project-open-data / csv-to-api

Proof of concept to dynamically generate RESTful APIs from static CSVs
http://labs.data.gov/csv-to-api/
326 stars 81 forks source link

top 5 things I hope somebody can tweak #17

Closed wdoerner closed 10 years ago

wdoerner commented 10 years ago

Great, simple API! I've been trying it out and noticed a few things that could be improved (I'm not a programmer so I can only contribute on the user end). My top 5 things that I hope somebody can tweak or address:

1) There is a file size limit. I know it won't work with a source file with over 90,000 rows. Not sure about less. 2) The sort is limited to a single variable. Can we add two or more? A good example of use is sorting over year and month, which isn't possible and can't be corrected by an end user. 3) The export filename doesn't have an extension. One of the computers I use is set to download certain files automatically and it downloads csv-to-api without a default .json or anything else (if the format variable is different). 4) The source file is restricted to a CSV. Can we do a TXT also that is in CSV format? I have some files that are saved identical to a CSV but have a TXT or XLS extension and cannot be read in. 5) Is there a way to query as http://www.agencyname.gov/csv-to-api?source=file.csv? Right now, the first part starts with http://labs.data.gov. This might be my lack of understanding but I copied the csv-to-api.php file over to our server and it doesn't work with agencyname.gov. There isn't any mention of this on the "Source Code and Documentation" link and it would be helpful to understand for folks who are starting out.

Thanks! ~ Will

waldoj commented 10 years ago

Great list, Will! In order:

  1. I suspect that the file size limit is a function of your own machine's resource limitations. I'll play with it and see.
  2. I'm not sure. I mean in theory, absolutely, but I'm not sure that it can be done easily within the existing framework, but I'll take a closer look at it.
  3. Well, that's clearly a bug. :) Thanks!
  4. Huh, interesting. Filtering against that extension was an easy to make sure that the content is viable, without having to actually download the data and examine it. I'm inclined to just expand the list of viable extensions to include .txt (not sure about .xls—CSV saved with that extension is really just a bug).
  5. Absolutely! As long as http://www.agencyname.gov/csv-to-api/ works (that is, it's at a valid URL, you don't get a 404), then you should be able to pass a URL like http://www.agencyname.gov/csv-to-api/?source=http://www.agencyname.gov/path/to/file.csv. Note that source does need to be a complete URL, so that the program will know where to find it. I'll figure out how to make the documentation more clear on this point.

I'll expand this into a series of issues now, to handle them each individually.

waldoj commented 10 years ago

The export filename doesn't have an extension.

Wait. I just thought this through. And now I'm confused. An API doesn't provide filenames, it just returns JSON. There's no file being transmitted, only data as a part of an HTTP response. I think everything is working properly here, you're just doing something with it that isn't normally done with APIs. :)

wdoerner commented 10 years ago

Thanks for the quick replies.

  1. I don't think CSVs have an actual file size limit but there might be a timeout issue like you alluded.
  2. One of these discussions might help to sort by multiple values in an array: http://us1.php.net/manual/en/function.usort.php#105764 http://us1.php.net/manual/en/function.usort.php#111462
  3. I think this was because of IE. We're still running IE 8.0 (my only browser and not by choice). I don't have any issues in Firefox or Chrome (I get the JSON format returned).
  4. Expanding to include TXT would be awesome!
  5. Every time I make that GET request with http://wwww.agencyname.gov/csv-to-api/..., it doesn't recognize the API and just downloads the source file. I imagine I might need to create a real folder where the two files sit for csv-to-api instead of a redirect path.

For a 6th request, I know it's bad practice with the csv-to-api format to not have the first row be variable names but, well, it happens. The problem is that the first row is automatically assigned to be the variable name keys. Could an argument be added like "novarnames=true" (default being false so it is optional)? When it's true, a quick foreach loop could assign names like var1-varx or col1-colx if the array is of size x. That would still allow to sort and filter over column names for files that lack variable names (using var1, var2, etc.). The obvious fix is to have standardized files but files are sometimes posted without variable names and this would give the end user more control.

Thanks again for the help!

wdoerner commented 10 years ago

I have a lucky number 7th request and then I'll let things rest as I try out more stuff. I don't want to abuse the help and time of others on here but think this would be a useful addition when a CSV has many (unwanted) fields.

Can a "keep" or "drop" function be added? I think the PHP term is "array_splice". It would need to support multiple variables like "keep=[var1,var2,var3]" or "drop=[var4,var5]". I write the code jointly because if it were "keep=var1&keep=var2" I am afraid that the first command might be executed and leave us without var2 (or vice versa).

http://stackoverflow.com/questions/16564650/best-way-to-delete-column-from-multidimensional-array http://stackoverflow.com/questions/3702573/remove-columns-from-the-subarrays-of-a-two-dimensional-array

waldoj commented 10 years ago
Every time I make that GET request with http://wwww.agencyname.gov/csv-to-api/..., it doesn't recognize the API and just downloads the source file.

Interesting. I suspect that this is two problems in play, but I don't know for sure. I speculate that your web server isn't figured to recognize JSON as a file type, and is just treating it as generic binary data, delivering it to the browser to figure out. And then your browser, I further speculate, has no capacity to figure out that it's JSON (a reasonable guess for IE 8), and it also just treats it as generic binary data. Do you happen to know what web server and version of that web server that your agency's website is running? That would help to get to the bottom of this. We might be able to force the use of some more descriptive headers or something, to work around this problem.

I know it's bad practice with the csv-to-api format to not have the first row be variable names but, well, it happens. The problem is that the first row is automatically assigned to be the variable name keys.

Hmm. Interesting problem. Your solution seems wholly reasonable. I've created #20 to address this.

I don't want to abuse the help and time of others on here

The great thing about the process here on GitHub is that there's no such thing as that. :) That is, any ideas for modifications that you or anybody else have can be posted here, and when somebody gets to them, great! It's my experience that it's best to make such suggestions for features or fixing bugs as soon as one encounters them—otherwise, we have a way of forgetting the change that we'd envisioned, or becoming accustomed to the bugs. So keep it up! :)

Can a "keep" or "drop" function be added?

Could you describe the functionality to me? Is the idea that keep=name would drop every column except name? And that drop=address would drop the address column, but keep all of the others? Do I have that right?

wdoerner commented 10 years ago

Thanks for replying to the other comments and creating the other issues to address them.

Do you happen to know what web server and version of that web server...?

Unfortunately, no. The GET works fine in Firefox and Chrome when I use the labs.data.gov in the URL. It appears that IE 8.0 is the problem here because it wants to download an XML file or displays the JSON in a single line. I think this error is with either the earlier version of the browser or the default behavior for a file type. Either way, I cannot fix that problem because of IT restrictions.

The other issue that I think I can address is when I replace the labs.data.gov part with www.agencyname.gov and it downloads the source file instead of executing the format change, filter, or sort. I believe I might be able to create a folder on the server (again, don't have permissions so I have to wait on it) because an alias did not work. The last option is a redirect that I will try out. If none of those work, then I will add another comment in here. I feel like it is at least useful to write down these steps in case others see this post later and to organize myself on the approach.

Could you describe the functionality to me? Is the idea that keep=name would drop every column except name? And that drop=address would drop the address column, but keep all of the others? Do I have that right?

Correct. The idea is to reduce the array size or only show certain columns/fields. In statistical software I am used to, the keep and drop commands are the common syntax that have the same goal but approach the task differently. A series of drop commands would do the same thing as a single keep.

There are times when you might have many columns and keep=name would avoid displaying all of them. Likewise, drop=address could tighten up the display of a dataset where address was not desired (if there are many fields, it's just quicker to do a drop than to type in all the fields for a keep). A flexible function would be one that reads in multiple elements (maybe enclosing with square brackets or quotes) if a user wants to keep or drop more than a single field, like keep=[address,city,state]. Another approach would be a joint command like keep=address&keep=city but that performs them both at the same time instead of executing either keep=address or keep=city (I've found that two sort commands would execute only one or the other but not both so I make the distinction here for clarity).

waldoj commented 10 years ago

I think this error is with either the earlier version of the browser or the default behavior for a file type.

Interesting! Well, the good news is that it probably doesn't matter. By which I mean that the purpose of CSV to API is for software to be able to access data via APIs, rather than web browsers. :) So while this is surely frustrating when trying to debug the data output in IE 8, it looks like everything's working OK (except IE 8, of course!). Still, I'm going to check on the output, to see if we can force a more specific header that might allow this to work better.

Re: keep and drop, that makes sense to me. I've just opened #21 to that end. Thanks for the explanation!