Closed konklone closed 10 years ago
All of these things are working for me. I think that to make this useful, we might want to accept a date range as well.
:+1: to a date range, sound extremely useful.
OK, I have the new features working. Eric and Dan could you look it over when you get a chance. And Dan, could you also review the new README?
If you guys are happy, I will close up the outstanding issues.
There is still an issue with downloading some of the older zip files. I don't think that is on our end. I reached out to GPO to see why we can't open them.
Thanks!
This is cool! I'm a little confused about how some stuff works now, though. If the default behavior is to download the files from GPO, I'd sort of like to have a little control over where those downloads go. This is tough given the myriad input / output combinations that are possible now. It seems like parsed files go exactly where I tell them to with -od
, but no matter what the precursor stuff ends up in source
relative to the parser code. So, here's what I might propose:
--extract-to
?) the zip is extracted to the path specified (starting at CREC-201x-....
), and then only the zip file in the tmp folder is cleaned up.dates
, and make it accept any number of args, so
./parse.py 2014-01-21 2014-01-22 2014-01-23
is a valid invocation.This is a lot, feel free to ping me offline to discuss.
I don't think it is a good idea to get rid of the text files because the parser doesn't parse all the files and people may want to look at some of the files we are not currently parsing. I think we should treat them as a type of result. I also like to have the previous step handy if the xml doesn't look right. Perhaps, we should just get rid of the .htm files and put the txt in a parallel folder?
I agree the "source files"- that are really results, should be in the same place by default with the user specify a different place.
I don't have any strong feelings here, but the way I've been managing some of our other parsers is to default the path for where downloaded/input files go (cache
) and output files go (data
), at the project's top-level, and then gitignore them.
The unitedstates/congress project allows the dir to be overrided in a config file, but you don't even need a config file to run the scraper.
But either way, to control where they actually go, I just use symlinks. So the scraper code doesn't have to worry much about paths, and deployment-specific concerns become deployment-specific work.
I think I have all the new features working; Tell me if you find any additional bugs.
Also, Dan, I think that Open Congress should use the original scraper, since it is more sophisticated and it can point the parser to files. There are flags to delete the documents and deliver the xml wherever you like.
I am going to take another pass at documentation tomorrow.
@LindsayYoung, I think you can close this, unless I'm missing something.
I think we are good.
This is a couple things at once. It should be possible to run the parser as:
If that day doesn't exist on disk, the day's file should be downloaded and unzipped, and then parsed.
If the day does exist on disk already, use that instead of re-downloading the file.
But, do allow the user to force a redownload, perhaps using a
--force
flag .infile
as a flag instead--force
flag to force a download whether or not it's on disk (and delete anything currently on disk first)