Breeding Bird Census data for independent assessment of transients

ahhurlbert commented 9 years ago

Breeding Bird Census in North America counts the number of breeding territories of all species within fixed areas, and independently lists "Visitors" which were not observed to have a territory. There should be ~hundreds of sites, many of which are sampled repeatedly.

Problem is data are in paragraph form in pdf files, need to be extracted. See, e.g., https://sora.unm.edu/sites/default/files/journals/jfo/v062s01/p0027-p0088.pdf

ethanwhite commented 9 years ago

If you can describe what we want to extract I'd be happy to take a go at ripping it out.

ahhurlbert commented 9 years ago

Below is the format of the data. Numbered site with site name in all caps. Really we want to pull out everything between "Census:" and "Total:", as well as the list after "Visitors:". But note below that because the pdf is on the older side, there are many issues with spacing of words and letters. For example, below is the text "Pacific-slopFe lycatcher2", which should be "Pacific-slope Flycatcher 2". Hard to imagine resolving all these typos purely computationally, especially when letters from one word are getting inserted inside the previous word.

NORTHERN OAK WOODLAND ARBOLADA DE ROBLE NORTE•O JOHN D. PETERSEN 13935 Highway 12, Glen Ellen CA 95442 Location: California; Sonoma Co.; Glen Ellen; Bouverie Ranch; 38ø22'N, 122ø30'W; Glen Ellen Quadrangle, USGS. Continuity: Established 1990; 6 yT. Size: 16.0 ha. Description of Plot: SeeJ. Field Ornithol. 62(Suppl.):32 (1991). Weather: Mean start temp., 8.1øC (range 4- 11øC)T. here wasa significanitn creasein rainfallt hiss eason(a pproximatel1y5 0%o f normal at the studys ite) after severayl earso f droughtc onditionsC. overage8: .3 h; 8 visits( 5 sunrise); 31 Mar; 8, 14, 22, 30 Apr; 6, 12, 26 May. Census: Plain Titmouse, 3.0 (8); Orange-crowned Warbler,3 .0; Dark-eyedJ unco, 3.0; Pacific-slopFe lycatcher2, .5; Ash-throatedF lycatcher2, .0; Chipping Sparrow,2 .0; WesternS crub-Jay1, .5; Violet-greenS wallow,1 .5; House Finch, 1.5; Mourning Dove, 1.0; Anna's Hummingbird, 1.0; Acorn Woodpecker, 1.0; Nuttall's Woodpecker, 1.0; WesternW ood-Pewee1, .0; White-breastedN uthatch, 1.0; House Wren, 1.0;W estern Bluebird, 1.0; American Robin, 1.0; Warbling Vireo, 1.0; Lesser Goldfinch, 1.0; Bullock's Oriole, 0.5; Northern Mockingbird, 0.5; European Starling, 0.5; California Towhee, 0.5. Total: 24 species3; 3.0 territories (83/40 ha). Visitors:R ed-shoulderedH awk, Red-tailedH awk, Wild Turkey, Great Horned Owl, American Crow, Bushtit, Solitary Vireo, Hutton's Vireo, BlackthroatedG rayW arbler,B lack-headedG rosbeakR. emarks:T he effectso f the increasei n rainfall were not apparent in this year's count. Wild Turkeys are becoming more evident and mayb e illustrativeo f a generali ncreasein numbersl ocallyA. cknowledgmentTsh: isw orkw as supported by Audubon Canyon Ranch, Stinson Beach, CA, as part of the monitoring of an oak regeneration project.
WILLOW RIPARIAN WOODLAND AND EDGE ARBOLADO DE SAUCE RIVERE•O Y BORDE DAVE RIENSCHE, MARTY MORI•OW & CHRISTINA GARCIA East Bay RegionaPl ark District,C oyotHe ills RegionaPl ark, 8000 PattersonR anchR oad, Fremont CA 94555 Site Number: CA9292061. Location: California; Alameda Co.; Fremont; Coyote Hills Regional Park;3 7ø34'N,1 22ø5'WN; ewarkQ uadrangleU, SGS.C ontinuity:E stablished1 992;4 yT.S ize: 8.1 ha. Description of Plot: See J. Field Ornithol. 64(Suppl.):35 (1993) and 65(Suppl.):46 (1994). Weather:M ean start temp., 10.6øC( range 3-17øC).C aliforniae xperienceda very wet season (the fourth wettest on record). Rainfall was 155% above normal, with a total rainfall of 67.9 cm. Coverage: 42.4 h; 17 visits (17 sunrise); 4, 15, 19, 28 Mar; 8, 9, 20, 28, 29 Apr; 2, 6, 16, 19, 21, 23 May; 4, 25 Jun. Maximum number of observers/visit5, . Census: Song Sparrow, 15.0 (74; 4N,11FL); Marsh Wren, 8.0 (40; 1N); California Towhee, 5.0 (25); Common Yellowthroat, 4.5 (22; 1N,2FL); Bushtit, 4.0 (20; 3N,10FL); Bewick's Wren, 3.0 (15; 2N,1FL); Brown-headedC owbird, 3.0; Chestnut-backedC hickadee, 2.0 (2N,5FL); Spotted Towhee, 2.0; Mourning Dove, 1.5; Pied-billed Grebe, 1.0 (1N,2FL); Mallard, 1.0; White-tailed Kite, 1.0 (1N,2FL); Virginia Rail, 1.0; American Coot, 1.0; Anna's Hummingbird, 1.0 (1N); Downy Woodpecker, 1.0 (1N,2FL); Tree Swallow, 1.0 (1N,3FL); Western Scrub-Jay, 1.0 (1N,2FL);E uropeanS tarling,1 .0 (2N,6FL);R ing-neckedP heasant+, ; WesternM eadowlark, +. Total: 22 species;5 8.0 territories (286/40 ha). Visitors: Green Heron, Black-crowned Night-Heron, Canada Goose, Gadwall, Northern Harrier, Cooper's Hawk, Sora, Common Moorhen, Pacific-slopeF lycatcher,B lack Phoebe, Swainson'sT hrush, American Robin, Northern Mockingbird, Orange-crownedW arbler, YellowW arbler, Wilson'sW arbler, Redwinged Blackbird. Remarks: The above normal rainfall in 1993, the near normal rainfall in 1994, and the above normal rainfall this year, continue to provide a lush willow riparian woodland and seasonawl etland edge. These provide better habitat conditionsf or increases in specieste rritories,n esting,a nd fledglings uccesfso r manyo f our breedings peciesO. ther Observers:M ary Richards,P enni SavageA. cknowledgmentsT:h e East Bay RegionalP ark 29 30] BreedinBgir dC eTzsus J.F ielOdr nithol. Autumn 1996 District who kindly provided time to work on the project. Wanda Spitler for typing the original report.

ahhurlbert commented 9 years ago

Ok, I think this is now doable if you have time to work on it. I just cleaned up the pdf using Acrobat's internal OCR, and then copying and pasting the text seems to work fine.

So here is a first take at what we want to pull out. Could be two files linked by Site Number and Year, or could be one big flat file, whatever is easiest:

Site Number Location Latitude Longitude Coverage_NumHrs Coverage_NumVisits Richness # after "Total: " and before " species;" Num_territories # after " species; " and before "territories"

Then everything following "Census:" should be pulled out into the following fields Species # species names listed in the "Census" section but also in the "Visitors:" section Mean_abundance # (following a ", " ) Max_abundance # (inside parentheses) Visitor # 0 for all spp from the "Census" section, 1 for all spp from the "Visitors" section

(NB: Not all accounts have a Visitors section)

Last point is that each pdf file is for a different year, so presumably this is all being done within a year loop. Need to make sure you add a Year field to store that info too.

I put two sample pdfs in the data subrepo > raw_datasets > BBC_pdfs folder.

ethanwhite commented 9 years ago

UPDATE: I worked on this all the way back from CA yesterday. I've made some decent progress including further improving the OCR using tesseract and starting to extract chunks from the text. Successfully parsing this automatically is definitely a bit tricky, but I'm making progress.

Current state of the code is here: https://github.com/ethanwhite/core-transient/blob/text-mining/bbc-text-mining.py

ethanwhite commented 9 years ago

This is a pretty gnarly problem. I like it :smiling_imp:

ethanwhite commented 9 years ago

Notes:

First step is to split major chunks which can be done by splitting on blank lines (to avoid splitting the major data paragraph)
- This is tricky because the page headers break up chunks in inconsistent ways (oh and there are different headers on alternating pages). I'm going to try to tackle this by automatically cropping out the header.
The individual chunks will then need to be split with a different method for each major chunk. The major data paragraph is the trickiest, but this appears to be reasonably attackable using a regex split.
- If someone has a few spare minutes at some point I could use a complete list of the all of the bolded title words occurring in those paragraphs for this. I have plenty to work with for the moment, but for the final run we'll need to have identified the complete list.
Once the main paragraph has been split based on its headers then we'll need to split each of those chunks again to extract the actual data.

ahhurlbert commented 9 years ago

@ethanwhite Awesome that you are tackling this. Ethan, I introduce to you Molly @mollyfrn, who is interested in these BBC data. She is currently going to start tracking down more pdfs for censuses before 1988 and after 1996. Molly, see if you can compile the bolded title words Ethan is referring to. If there's anything you're unclear on, you can just ask in a comment here on the Github issues page!

ethanwhite commented 9 years ago

hey @mollyfrn - nice to make your digital acquaintance. I look forward to meeting you in person when I'm up at UNC in the spring.

mollyfrn commented 9 years ago

Hey! Looking forward to meeting you too! Thank you so much for developing this code. I'm really looking forward to working with these BBC data. :^)

ethanwhite commented 9 years ago

Small tasks that would be helpful for me going forward:

[x] Confirm that all values for Size are in ha.
[x] Determine whether or not we need the parenthetical information in Census, e.g., (48/40 ha)
[x] Determine how we want to store the + values in Census
[x] Confirm that the first number in Location is always the start of the latitude
[x] Confirm that all sites are in N. America
[x] Do we want information on when visits were conducted (morning, afternoon, evening)

ahhurlbert commented 9 years ago

For Size, can you just store the units immediately following the value so we can confirm later?
Yes, we want the parenthetical in the total (e.g., 48/40 ha) as this gives us the ability to test/check our two other values that are read in (Size, and # territories; i.e., these two values have simply been combined to yield the number of territories expected in 40 ha--HOLD ON, IN SOME YEARS THIS VALUE IS GIVEN AS, E.G. "(48/km2), WHICH MEANS WE'LL HAVE TO STORE BOTH THE NUMERIC VALUE FOLLOWING THE '/' AS WELL AS THE UNITS [AND ASSIGN A NUMERIC VALUE OF 1 IF THERE IS NONE]).
Need to find a source with explanation of +, but I'd say for now we either replace it with 0.01 (as it clearly reflects a low, non-zero density), or -999 (so that it's obvious that we need to remember to address it). Whatever you feel is better practice here.
Location: can't confirm, but let's just examine a histogram after you read them all in.
Should be all in N. America, possibly Puerto Rico included occasionally?
As long as you're pulling out total hours of coverage, it's hard to imagine breaking any analysis down further by morning vs afternoon vs evening hours. Skip it.

ethanwhite commented 9 years ago

For Size, can you just store the units immediately following the value so we can confirm later?

Based on how I ended up handling this it looks like yes, it's only ha's (at least for the two years I've run) and the processing will throw an error if something else crops up so I'll see it.

Yes, we want the parenthetical in the total (e.g., 48/40 ha) as this gives us the ability to test/check our two other values that are read in (Size, and # territories; i.e., these two values have simply been combined to yield the number of territories expected in 40 ha--HOLD ON, IN SOME YEARS THIS VALUE IS GIVEN AS, E.G. "(48/km2), WHICH MEANS WE'LL HAVE TO STORE BOTH THE NUMERIC VALUE FOLLOWING THE '/' AS WELL AS THE UNITS [AND ASSIGN A NUMERIC VALUE OF 1 IF THERE IS NONE]).

This looks like a pretty tricky problem since the presence of those parentheticals is not consistent, so if it's just a check on the other two numbers I wonder if its worth the ~several hours it will take me to figure out how to solve it.

ahhurlbert commented 9 years ago

Ok, sounds good. We'll ignore it for now.

ethanwhite commented 9 years ago

I am currently treating "territories" (post 1988) and "territorial males" (1988) in the Total section as the same thing. If that's not correct someone should let me know.

ahhurlbert commented 9 years ago

Yep, that's correct.

ethanwhite commented 9 years ago

I just put in a PR with the text-mining code (#52) and pushed three tables of data to the data repo. Take a look when you get the chance and let me know where the errors are and any data that we need that is missing and I'll get it all cleaned up.

ahhurlbert commented 9 years ago

Spectacular and amazing! Thanks Ethan!

Some things I noticed (many of which may best be fixed by hand after the fact rather than programatically):

[x] Would be good to store site area (labeled "Size" in bold) in its own field in the censuses table.
[ ] Looks like some characters are not making a smooth Unicode translation, e.g. some dashes as in "Black-andâ€”white Warbler" and "3â€”29Â°C." (A-hat getting inserted before degree symbol) across all tables.
[x] under terr_notes in bbc_censuses, a number of "km3" and "kmz" should be "km2"
[x] In bbc_counts, same special character issue with dashes (â€”) and 's (â€™).
[x] In bbs_counts, instead of "resident", let's call them "breeder"
[x] In bbs_counts, "Goldï¬nch" should be "Goldfinch" (actually that looks like it didn't paste right here) and "Kingï¬sher" should be "Kingfisher" (really looks like all lowercase 'f's need to be converted.
[x] In bbs_counts, the following have unnecessary hyphens: "Northern Water-thrush", "Ru-fous-sided Towhee" (2nd one is ok), "Amer-ican", "Gnat-catcher", "Tit-mouse", "Mourn-ing", "Yel-lowthroat", "Wood-pecker", "Tan-ager", "North-ern", "Hum-mingbird", "Chip-ping", "Pi-leated", "Gros-beak", "Com-mon", "European Star-ling", "Grass-hopper", "War-bler", "Bun-ting", "Mock-ingbird", "Ameri-can", "Chest-nut", "Mal-lard", "War-bling", "Yel-low-bellied" (2nd one ok), "Fly-catcher", "Wood-cock", "Blackbur-nian", "Car-olina", "Blue-bird", "Eu-ropean", "Crest-ed", ...ok there's a lot. Probably to be fixed by hand...
[x] In bbs_counts, lines 530-542 list names of people instead of bird species.
[x] In bbs_counts, "Goldfinch" is frequently the last species in the census, and often non-species name text gets appended to this species name. In lines 359-363 lots of text is appended and spans multiple lines.
[x] In bbs_counts, lines 1066 & 1119: "Brown-headed Cowbird and Lawrenceâ€™s Goldï¬nch" should be split into two separate entries.
[x] In bbc_counts, "Ruby-throatcd Hummingbird" should be "Ruby-throated Hummingbird"
[x] In bbc_counts, "RuHed Grouse" should be "Ruffed Grouse"
[x] In bbc_counts "\Varbler" and "VVarbler" should be "Warbler"
[x] In bbc_counts, line 1398, extra text appended after "Pine Siskin." I think making sure you do not grab text following a "." which ends the Visitors section. The Remarks section is frequently getting appended, and then any commas within that section are creating new species entries for Remarks text.
[x] In bbc_counts, "Black-headed Gros-beak and Lawrenceâ€™s Goldï¬nch" should be split. There should be no species name with " and " in the middle of it, so this would be another string to split on besides commas. (Note there are species names with "-and-", e.g. Black-and-white warbler.

Will keep looking, but can't find an instance of the same site recorded in multiple years to make sure siteID will actually link those records through time.

ethanwhite commented 9 years ago

Looks like some characters are not making a smooth Unicode translation, e.g. some dashes as in "Black-andâ€”white Warbler" and "3â€”29Â°C." (A-hat getting inserted before degree symbol) across all tables.

Fixed the dashes, can you given me a table and row number for the 3â€”29Â°C and the (â€™) so I can take a look. (With Python 3 and Libre Office unicode displays fine so it's hard to see it).

ahhurlbert commented 9 years ago

Every row of bbs_censuses in the weather section reports temperature in degrees C, and has the problem of displaying "Â°C" instead of "°C".

As for "â€™", two examples are in the weather section, lines 150 and 381. Again, this code is replacing apostrophes.

ethanwhite commented 9 years ago

... "Blackbur-nian", "Car-olina", "Blue-bird", "Eu-ropean", "Crest-ed", ...ok there's a lot. Probably to be fixed by hand...

I just pushed a valid_species.csv file to the data repo. At the moment it's just a list of all of the unique values in the species column. If you can go through this and clean it up so that it only has the one correct form of each species name in it then I can work on automating the - clean up. It's a bit messy at the moment due to some of the other things that need to be cleaned up, but at ~1000 values is seems like it should still be reasonably manageable.

ethanwhite commented 9 years ago

Don't do the cleanup of valid_species just yet. No big deal if you already have, but I might be able to make this a lot easier so give me a few minutes to check it out.

ethanwhite commented 9 years ago

OK, go ahead and take a look at valid_species.csv now. I shifted to being overly aggressive in getting rid of - rather than not getting rid of enough of them, so one of the big things you'll be checking for is missing dashes. Once we have this list I'll then fuzzy match each species name against it and replace it with the valid name in cases where a dash gets inappropriately removed.

ethanwhite commented 9 years ago

There should be no species name with " and " in the middle of it, so this would be another string to split on besides commas. (Note there are species names with "-and-", e.g. Black-and-white warbler.

This one required a little creativity... and not for the reason I initially thought it would :)

ethanwhite commented 9 years ago

New data pushed to repo and new PR submitted. These should fix everything on the list except probably the Â°C issue because I can't successfully distinguish between Â°C and °C at the moment (i.e., my computer thinks that the symbol it's working with is ° and not something else that would be causing issues.

Let me know what you find next. Once you've valid_species.csv I'll work on fuzzy matching to fixing the missing -. Enjoy!

ahhurlbert commented 9 years ago

Not sure I understand what's been done. All of the files including bbs_counts.csv have been updated with these corrections? I still see lots of "â€”" instead of "-", and "â€™" instead of apostrophes, and "ï¬" for "fi". Also, "resident" has not been replaced with "breeder", despite the box checking above. So am I somehow missing updated files from the pull request given that I have definitely updated my local repo by pulling from master.

ethanwhite commented 9 years ago

Sorry, that's totally my fault. I pushed, posted, and went to help cook dinner and didn't realize that the push had been rejected. It should be all set now.

ahhurlbert commented 9 years ago

I've pushed a new file bbc_species_corrections.csv which has fields for the original species name, the cleaned name (where it should be cleaned, otherwise it is blank), the Freq with which the original name appeared in the bbc_counts file, and a Notes field.

Notes field values:

delete: text unrelated to a species name, e.g. name of Observers
split text: where two species are listed in the same entry, usu separated by " and "
split text, and Grack1e should be Grackle: as above but with typo correction
taxonomy: heads up for data cleaners that the name is out of date and new valid name will depend on where geographically the old name was used
help: a heads up that these original names did not get cleaned when I tried in R because they start with or contain several special characters, so you might want to double check them.

NB: Most of the name cleaning (paring down from 739 names to 393) was not missing hyphens, but other misreadings of the PDF, e.g. "\N" or "VV" for "W", "/\/\" or "I\/I" or "i\/i" for "M", etc.

ethanwhite commented 9 years ago

OK, updated script is PR'd and new data tables are pushed to the data repo. Let me know what issues you find next :smile:.

ahhurlbert commented 9 years ago

So great. You're like a magician.

Next task: See if your wand works on these new earlier BBCs from the 1970s where the report text is split into 2 columns! (and hopefully the section names haven't changed....!)

ethanwhite commented 9 years ago

Don't you understand that it's common courtesy to take at least two weeks to get back to me on something so that I don't have to worry about it for a while! I have to switch gears for the next 10 days since the Moore folks are coming into town for a site visit, but once that's over I'll get back on this.

The two-column format will definitely need to be handled a little differently. If everything looks clean in the current data let's go ahead and close this issue, which is getting kind of long and unwieldy, and open a new one for adding the new data.

ahhurlbert commented 9 years ago

The issue never closes!

Ok, you're right. And I obviously didn't expect you to dive back in immediately. At your convenience as always. But I get so excited seeing this progress!

@mollyfrn Why don't you post a new issue outlining the years that you've added that have the two column format, and any other details you've noticed that might be different from the original BBC censuses that Ethan has parsed.

hurlbertlab / core-transient

Breeding Bird Census data for independent assessment of transients #50