unitedstates / congressional-record

A parser for the Congressional Record.
Other
128 stars 40 forks source link

Remove --findfileforme, use date as first argument, use downloaded cache #8

Closed konklone closed 10 years ago

konklone commented 10 years ago

This is a couple things at once. It should be possible to run the parser as:

./parser.py 2014-01-21

If that day doesn't exist on disk, the day's file should be downloaded and unzipped, and then parsed.

If the day does exist on disk already, use that instead of re-downloading the file.

But, do allow the user to force a redownload, perhaps using a --force flag .

LindsayYoung commented 10 years ago

All of these things are working for me. I think that to make this useful, we might want to accept a date range as well.

konklone commented 10 years ago

:+1: to a date range, sound extremely useful.

LindsayYoung commented 10 years ago

OK, I have the new features working. Eric and Dan could you look it over when you get a chance. And Dan, could you also review the new README?

If you guys are happy, I will close up the outstanding issues.

There is still an issue with downloading some of the older zip files. I don't think that is on our end. I reached out to GPO to see why we can't open them.

Thanks!

drinks commented 10 years ago

This is cool! I'm a little confused about how some stuff works now, though. If the default behavior is to download the files from GPO, I'd sort of like to have a little control over where those downloads go. This is tough given the myriad input / output combinations that are possible now. It seems like parsed files go exactly where I tell them to with -od, but no matter what the precursor stuff ends up in source relative to the parser code. So, here's what I might propose:

This is a lot, feel free to ping me offline to discuss.

LindsayYoung commented 10 years ago

I don't think it is a good idea to get rid of the text files because the parser doesn't parse all the files and people may want to look at some of the files we are not currently parsing. I think we should treat them as a type of result. I also like to have the previous step handy if the xml doesn't look right. Perhaps, we should just get rid of the .htm files and put the txt in a parallel folder?

I agree the "source files"- that are really results, should be in the same place by default with the user specify a different place.

konklone commented 10 years ago

I don't have any strong feelings here, but the way I've been managing some of our other parsers is to default the path for where downloaded/input files go (cache) and output files go (data), at the project's top-level, and then gitignore them.

The unitedstates/congress project allows the dir to be overrided in a config file, but you don't even need a config file to run the scraper.

But either way, to control where they actually go, I just use symlinks. So the scraper code doesn't have to worry much about paths, and deployment-specific concerns become deployment-specific work.

LindsayYoung commented 10 years ago

I think I have all the new features working; Tell me if you find any additional bugs.

Also, Dan, I think that Open Congress should use the original scraper, since it is more sophisticated and it can point the parser to files. There are flags to delete the documents and deliver the xml wherever you like.

I am going to take another pass at documentation tomorrow.

konklone commented 10 years ago

@LindsayYoung, I think you can close this, unless I'm missing something.

LindsayYoung commented 10 years ago

I think we are good.