Closed ahhurlbert closed 9 years ago
If you can describe what we want to extract I'd be happy to take a go at ripping it out.
Below is the format of the data. Numbered site with site name in all caps. Really we want to pull out everything between "Census:" and "Total:", as well as the list after "Visitors:". But note below that because the pdf is on the older side, there are many issues with spacing of words and letters. For example, below is the text "Pacific-slopFe lycatcher2", which should be "Pacific-slope Flycatcher 2". Hard to imagine resolving all these typos purely computationally, especially when letters from one word are getting inserted inside the previous word.
Ok, I think this is now doable if you have time to work on it. I just cleaned up the pdf using Acrobat's internal OCR, and then copying and pasting the text seems to work fine.
So here is a first take at what we want to pull out. Could be two files linked by Site Number and Year, or could be one big flat file, whatever is easiest:
Site Number Location Latitude Longitude Coverage_NumHrs Coverage_NumVisits Richness # after "Total: " and before " species;" Num_territories # after " species; " and before "territories"
Then everything following "Census:" should be pulled out into the following fields Species # species names listed in the "Census" section but also in the "Visitors:" section Mean_abundance # (following a ", " ) Max_abundance # (inside parentheses) Visitor # 0 for all spp from the "Census" section, 1 for all spp from the "Visitors" section
(NB: Not all accounts have a Visitors section)
Last point is that each pdf file is for a different year, so presumably this is all being done within a year loop. Need to make sure you add a Year field to store that info too.
I put two sample pdfs in the data subrepo > raw_datasets > BBC_pdfs folder.
UPDATE: I worked on this all the way back from CA yesterday. I've made some decent progress including further improving the OCR using tesseract and starting to extract chunks from the text. Successfully parsing this automatically is definitely a bit tricky, but I'm making progress.
Current state of the code is here: https://github.com/ethanwhite/core-transient/blob/text-mining/bbc-text-mining.py
This is a pretty gnarly problem. I like it :smiling_imp:
Notes:
@ethanwhite Awesome that you are tackling this. Ethan, I introduce to you Molly @mollyfrn, who is interested in these BBC data. She is currently going to start tracking down more pdfs for censuses before 1988 and after 1996. Molly, see if you can compile the bolded title words Ethan is referring to. If there's anything you're unclear on, you can just ask in a comment here on the Github issues page!
hey @mollyfrn - nice to make your digital acquaintance. I look forward to meeting you in person when I'm up at UNC in the spring.
Hey! Looking forward to meeting you too! Thank you so much for developing this code. I'm really looking forward to working with these BBC data. :^)
Small tasks that would be helpful for me going forward:
Size
are in ha.Census
, e.g., (48/40 ha)+
values in Census
Location
is always the start of the latitudeSize
, can you just store the units immediately following the value so we can confirm later? For Size, can you just store the units immediately following the value so we can confirm later?
Based on how I ended up handling this it looks like yes, it's only ha's (at least for the two years I've run) and the processing will throw an error if something else crops up so I'll see it.
Yes, we want the parenthetical in the total (e.g., 48/40 ha) as this gives us the ability to test/check our two other values that are read in (Size, and # territories; i.e., these two values have simply been combined to yield the number of territories expected in 40 ha--HOLD ON, IN SOME YEARS THIS VALUE IS GIVEN AS, E.G. "(48/km2), WHICH MEANS WE'LL HAVE TO STORE BOTH THE NUMERIC VALUE FOLLOWING THE '/' AS WELL AS THE UNITS [AND ASSIGN A NUMERIC VALUE OF 1 IF THERE IS NONE]).
This looks like a pretty tricky problem since the presence of those parentheticals is not consistent, so if it's just a check on the other two numbers I wonder if its worth the ~several hours it will take me to figure out how to solve it.
Ok, sounds good. We'll ignore it for now.
I am currently treating "territories" (post 1988) and "territorial males" (1988) in the Total section as the same thing. If that's not correct someone should let me know.
Yep, that's correct.
I just put in a PR with the text-mining code (#52) and pushed three tables of data to the data repo. Take a look when you get the chance and let me know where the errors are and any data that we need that is missing and I'll get it all cleaned up.
Spectacular and amazing! Thanks Ethan!
Some things I noticed (many of which may best be fixed by hand after the fact rather than programatically):
Will keep looking, but can't find an instance of the same site recorded in multiple years to make sure siteID will actually link those records through time.
Looks like some characters are not making a smooth Unicode translation, e.g. some dashes as in "Black-and—white Warbler" and "3—29°C." (A-hat getting inserted before degree symbol) across all tables.
Fixed the dashes, can you given me a table and row number for the 3—29°C
and the (’)
so I can take a look. (With Python 3 and Libre Office unicode displays fine so it's hard to see it).
Every row of bbs_censuses in the weather section reports temperature in degrees C, and has the problem of displaying "°C" instead of "°C".
As for "’", two examples are in the weather section, lines 150 and 381. Again, this code is replacing apostrophes.
... "Blackbur-nian", "Car-olina", "Blue-bird", "Eu-ropean", "Crest-ed", ...ok there's a lot. Probably to be fixed by hand...
I just pushed a valid_species.csv
file to the data repo. At the moment it's just a list of all of the unique values in the species column. If you can go through this and clean it up so that it only has the one correct form of each species name in it then I can work on automating the -
clean up. It's a bit messy at the moment due to some of the other things that need to be cleaned up, but at ~1000 values is seems like it should still be reasonably manageable.
Don't do the cleanup of valid_species just yet. No big deal if you already have, but I might be able to make this a lot easier so give me a few minutes to check it out.
OK, go ahead and take a look at valid_species.csv
now. I shifted to being overly aggressive in getting rid of -
rather than not getting rid of enough of them, so one of the big things you'll be checking for is missing dashes. Once we have this list I'll then fuzzy match each species name against it and replace it with the valid name in cases where a dash gets inappropriately removed.
There should be no species name with " and " in the middle of it, so this would be another string to split on besides commas. (Note there are species names with "-and-", e.g. Black-and-white warbler.
This one required a little creativity... and not for the reason I initially thought it would :)
New data pushed to repo and new PR submitted. These should fix everything on the list except probably the °C
issue because I can't successfully distinguish between °C
and °C
at the moment (i.e., my computer thinks that the symbol it's working with is °
and not something else that would be causing issues.
Let me know what you find next. Once you've valid_species.csv
I'll work on fuzzy matching to fixing the missing -
. Enjoy!
Not sure I understand what's been done. All of the files including bbs_counts.csv have been updated with these corrections? I still see lots of "—" instead of "-", and "’" instead of apostrophes, and "ï¬" for "fi". Also, "resident" has not been replaced with "breeder", despite the box checking above. So am I somehow missing updated files from the pull request given that I have definitely updated my local repo by pulling from master.
Sorry, that's totally my fault. I pushed, posted, and went to help cook dinner and didn't realize that the push had been rejected. It should be all set now.
I've pushed a new file bbc_species_corrections.csv
which has fields for the original species name, the cleaned name (where it should be cleaned, otherwise it is blank), the Freq with which the original name appeared in the bbc_counts file, and a Notes field.
Notes field values:
NB: Most of the name cleaning (paring down from 739 names to 393) was not missing hyphens, but other misreadings of the PDF, e.g. "\N" or "VV" for "W", "/\/\" or "I\/I" or "i\/i" for "M", etc.
OK, updated script is PR'd and new data tables are pushed to the data repo. Let me know what issues you find next :smile:.
So great. You're like a magician.
Next task: See if your wand works on these new earlier BBCs from the 1970s where the report text is split into 2 columns! (and hopefully the section names haven't changed....!)
Don't you understand that it's common courtesy to take at least two weeks to get back to me on something so that I don't have to worry about it for a while! I have to switch gears for the next 10 days since the Moore folks are coming into town for a site visit, but once that's over I'll get back on this.
The two-column format will definitely need to be handled a little differently. If everything looks clean in the current data let's go ahead and close this issue, which is getting kind of long and unwieldy, and open a new one for adding the new data.
The issue never closes!
Ok, you're right. And I obviously didn't expect you to dive back in immediately. At your convenience as always. But I get so excited seeing this progress!
@mollyfrn Why don't you post a new issue outlining the years that you've added that have the two column format, and any other details you've noticed that might be different from the original BBC censuses that Ethan has parsed.
Breeding Bird Census in North America counts the number of breeding territories of all species within fixed areas, and independently lists "Visitors" which were not observed to have a territory. There should be ~hundreds of sites, many of which are sampled repeatedly.
Problem is data are in paragraph form in pdf files, need to be extracted. See, e.g., https://sora.unm.edu/sites/default/files/journals/jfo/v062s01/p0027-p0088.pdf