Open dcldmartin opened 5 years ago
This particular institution took weeks for me to parse, and I skipped over a large chunk of the files because the data was incomplete. Take a look at the script to see the logic (for example, I skipped over the files that have _All) and please open a PR if you are able to parse missing files. Importantly, something to keep in mind - the limit of Github file sizes is 100MB, and right now (with missing files) we already hit that barrier with data-latest-2.tsv. Since the original two files are written based on specific indices and sizes with a subset of skipped flies, updating the data isn't as trivial as adding more parsers to the list because the sizes would then be off. Thus, if you write another parser, we would want to add logic to match (another set of patterns) to write to data-latest-3.tsv. The alternative is to do the whole thing over and measure the file size as you go, but I doubt you have the weeks or patience to do that :P
Interesting project! I'm looking at the OSHPD CA data. The records appear to list 354 distinct hospital_id values, but there are only 130 in the combined TSVs of the latest data (data-latest-1.tsv and data-latest-2.tsv).
I'll look into the parser and submit PRs if I find anything, but do you have any thoughts?