Additional Clarity on Size Field

seanherron commented 11 years ago

The size field, which is intended to represent the file size of the resource, has many potential areas for improvement.

First off, we should either decide on a standard unit of measurement (like the dct:bytesize) or break out the unit of measurement from the numeric value (eg. two fields, size (numeric value) and sizeUnit (unit of measurement). This would allow for machine reading of the value, enabling users to sort and filter by the size of the resource, as well as reducing confusion when multiple standards of measurement are used.

Secondly, what is the rationale behind the cardinality enabling multiple values? Is this in relation to the possibility of having multiple accessURLs? If so, how do we draw a link between a specific accessURL and its size? We could specify that they be represented in order? I'm not a huge fan of that approach but I can't think of a better way to do it.

Finally, I think this should be renamed either bytesize (if it is represented in bytes) or filesize (if represented in other units of measurement). The rationale for this is that size can be interpreted to mean a variety of things (eg. size of geographic area covered, number of rows of data, etc). Bytesize or filesize clarify this.

mhogeweg commented 11 years ago

It's unclear to me what the purpose is of the size field. Especially when working with API and web services, 'size' depends on the specific request for (a subset of) the data and the format the data is returned in. Is it the size of a zipped file that's made available or the unzipped data? what is the size of the Landsat archive (millions of images collected over several decades) vs a picture of NDVI generated from this archive (on-the-fly as part of the web service request) for a small portion of the US?

seanherron commented 11 years ago

Good point - I thought about this myself. Government still distributes tons of data via raw file, probably way more often than via APIs or web services. In many circumstances, raw file access is probably the best way to do this, and accessURL (which size is linked to) is inclusive of direct download to raw files.

When accessURL is linked to a raw file, I would say that showing the size of the file is a good practice. If, as an extreme example, we linked to a gigantic zip file of the entire Landsat archive (but probably more realistically something like FDA SPL data, which is distributed in CSV), people should know the size of the file before they click on it, in particular if someone on a mobile connection wants to quickly check out some tabular data but doesn't realize the file is actually 300,000 records in a 40mb file or something.

As a side note, this is only useful if size changes as the file itself changes, which would necessitate either human intervention or server-side automation by agencies to update on a regular basis.

mhogeweg commented 11 years ago

Just your last point would make me concerned about relying on the currency/accuracy of the size attribute whenever I see it. People just don't (manually) update this type of metadata. This is speaking from over 10 years working with Geospatial One-Stop and Data.gov in the US and various National Spatial Data Infrastructures globally.

On the web, whenever I click a link to download a file, my browser tells me how much bytes I'm about to download. That's directly associated with the actual file/stream/thing I'm about to download. Isn't that enough information for someone to decide to continue or not?

You describe a use case where someone is on a mobile device wanting to get some data. Do you know if there's an activity related to Data.gov to collect/define/design the various use cases? What IS the expected use of Data.gov in that sense? Are there apps (mobile/web/desktop/...) that people are building using datasets/services found at Data.gov that would then be used for the things you describe? Would those apps be findable at Data.gov?

seanherron commented 11 years ago

I agree with your point that this is not something we can reasonably expect that people will manually update, hence my point about it (hopefully) being automated.

I'm not aware of an activity for data.gov to collect use cases, though I believe http://next.data.gov/ is hoping to achieve that to some extent.

Hopefully one of the authors of the schema can chime in here on why they felt size should be included. I'm with you on a lot of your points, and I admit that mobile downloads of data is pretty edge use case, and most other use cases I can think of (either bandwidth-constrained, bandwidth-capped, or storage constrained environments) would be negated by the fact that we don't really have a way of ensuring this value is correct in the first place.

MarionRoyal commented 11 years ago

The field SIZE was used in the standard Data.gov Metadata template in the manner that you have presumed and was probably just carried over into this schema. Originally, it was to provide the user an idea of the amount of resources needed before making a choice to download a block of data (disk space, time, ...) I could probably argue that this is good to know before my browser informs me. It could be checked on a mobile app, before taking some action. We (at data.gov) have never used "size" as a metric of our progress in achieving open data and I don't believe it is a valid metric going forward. Points well made on not being applicable to API's and web services. So "size" probably rightfully deserves to carry on in the Required if Applicable section. However, it will be applicable to the vast majority of records.

With regards to changing the name of the field: As I age, I am becoming less concerned or at least ambivalent on the nouns chosen to express a concept (object) as long as the word is easily understood within a context (or namespace if you will) and mappable to others. I am confident that "size" in the context of this schema will not be confused with "dimension". Having said that, it would probably be an improvement to recognize DCAT:byteSize in future revisions. That, of course, unless we invent a new noun to represent mass on a storage device.

On Wed, Jul 31, 2013 at 1:41 AM, Sean Herron notifications@github.comwrote:

I agree with your point that this is not something we can reasonably expect that people will manually update, hence my point about it (hopefully) being automated.

I'm not aware of an activity for data.gov to collect use cases, though I believe http://next.data.gov/ is hoping to achieve that to some extent.

Hopefully one of the authors of the schema can chime in here on why they felt size should be included. I'm with you on a lot of your points, and I admit that mobile downloads of data is pretty edge use case, and most other use cases I can think of (either bandwidth-constrained, bandwidth-capped, or storage constrained environments) would be negated by the fact that we don't really have a way of ensuring this value is correct in the first place.

— Reply to this email directly or view it on GitHubhttps://github.com/project-open-data/project-open-data.github.io/issues/101#issuecomment-21841371 .

Marion A. Royal PMP Program Director, DataGov GSA Office of Citizen Services and Innovative Technologies 202.302.4634

seanherron commented 11 years ago

@MarionRoyal: Thanks for the background. In regards to converting size to recognize bytesize, I'm imagining that the schema ought mandate values be given in bytes rather than just allowing for byte values, otherwise we still have the issues I brought up in the original post, right?

MarionRoyal commented 11 years ago

@SeanHerron: If you are asking me if I think we also need a sizeUnit, I would say no. I think the existing field is a text field rather than a decimal field - which means that a valid entry could include the number of bytes (if less than a kilobyte) or could include a set of alphanumeric characters which would most likely include letters K, M, G, T, P and could easily be grokked by an app (and maybe even a human). The problem with have a sizeUnit for this purpose is that it would suggest a need for controlled vocabulary for this new field, which I think we are trying to avoid.

so, I would agree with changing the field name to byteSize (since it matches DCAT) and would have no objection to fileSize (since it is a recognized PHP term), but would leave sizeUnit to other more precise domains.

On Wed, Jul 31, 2013 at 11:19 AM, Sean Herron notifications@github.comwrote:

@MarionRoyal https://github.com/MarionRoyal: Thanks for the background. In regards to converting size to recognize bytesize, I'm imagining that the schema ought mandate values be given in bytes rather than just allowing for byte values, otherwise we still have the issues I brought up in the original post, right?

— Reply to this email directly or view it on GitHubhttps://github.com/project-open-data/project-open-data.github.io/issues/101#issuecomment-21870776 .

Marion A. Royal PMP Program Director, DataGov GSA Office of Citizen Services and Innovative Technologies 202.302.4634

MarinaNitze commented 11 years ago

I like @MarionRoyal's idea to adopt DCAT:byteSize -- but flag that not everyone knows what a byte is, so we should link to some sort of basic calculator folks can use to convert from more-familiar KB/MB/GB.

This topic has come up a lot. Ultimately, the size field is not a deeply reliable measure if we are asking people to populate it by hand, because file size changes if so much as a punctuation mark is edited in the source file, and is largely meaningless when applied to APIs, as outlined above. I think those of us who are more technical appreciate this, but we could stand to be clearer to the less-technical folks that they should not be using this field for any sort of precise measurement or for compliance purposes.

Since it's not precise, I am less inclined to make it fully machine-readable with separate size and sizeUnit is overkill, because if you're machine-reading you can probably also automatically calculate files' true sizes.

skybristol commented 11 years ago

I think the only thing that scales at the relatively crude level of discovery metadata currently being discussed is to do as @MarionRoyal suggests and leave it as a rough textual notification to downstream users. Best practice would be to include some type of units or explanation in the attribute so a human reading it might have a clue on what they are getting into. Otherwise, we'd need to look across various standards on how the magnitude of a given asset might be described and account for all the specifics.

gbinal commented 11 years ago

+1 for keeping this more textual and the use of letters K, M, G, T, P. I'm envisioning the spectrum of catalog creators and think that the low bar is appropriate here. I also don't think there'll be many use cases for machine-consumption of this field.

If so, wouldn't it then be best to stick with filesize so as to avoid the need for everyone to go to a filesize catalog each time?

seanherron commented 11 years ago

It seems like we're all in agreement that the field isn't particularly useful or relevant, so I'm going to go against my original idea and say we just leave as is to prevent complication. Maybe in the future if we look to pare down the schema this would be a good field to deprecate.

jpmckinney commented 11 years ago

Is this a duplicate of #55?

seanherron commented 11 years ago

Yes, looks like it. I can close this and reference 55 if you'd like. Didn't come across it when I was posting.

jpmckinney commented 11 years ago

@seanherron I've only skimmed the discussion in this thread, but makes sense!

project-open-data / project-open-data.github.io

Additional Clarity on Size Field #101