neelkandlikar / water-sentiment

0 stars 0 forks source link

07 abstract count #15

Closed NeerajRattehalli closed 4 years ago

NeerajRattehalli commented 4 years ago

I worked on the abstract counts. The final output file should be in outputs\abstracts-per-year.png. The data however seems to contain a subset of the years [1963, 1976, 1977, 1988, 1989, 2000, 2014]. I looked specifically from the following column "abstracts-retrieval-response.coredata.prism:coverDate". Am I looking in the wrong column? Or is there somewhere else I need to get the data from? I checked some of the data files manually as well and it seems to be doing the same thing.

richpauloo commented 4 years ago

@NeerajRattehalli - thanks for submitting this. The issue is either in the (1) cleaning code, (2) underlying data, or (3) interpretation of the abstracts-retrieval-response.coredata.prism:coverDate field.

  1. I find it hard to believe that the 65k abstracts contain only 7 unique years. @NeerajRattehalli, can you double check your work beginning with the raw abstracts and walking through your steps? Or plot as a line with time on x and count on y? The histogram may be doing some automatic binning. Also, is there no other "date" field that may be a publication date? Please check.

  2. @porefluid - can you verify that the scraping script is picking up abstracts from all years, and that @NeerajRattehalli is indeed looking at the correct column in the chunked abstracts? Here's the branch and notebook to consult, and your scraper.

  3. Lastly, @porefluid, in acquiring this data, did you happen upon metadata that may indicate what the prism:coverDate means? Perhaps it is not the "publication date" that we have in mind.

NeerajRattehalli commented 4 years ago

@NeerajRattehalli - thanks for submitting this. The issue is either in the (1) cleaning code, (2) underlying data, or (3) interpretation of the abstracts-retrieval-response.coredata.prism:coverDate field.

I did the value counts and only saw 7 unique years which was kind of suprising...I also checked directly in the abstract chunks and saw the same subset of dates (yyyy-mm-dd) format to be repeated. I might need to look elsewhere, but no other header item clearly indicated a date reference. I saw the date mentioned occasionally in the abstract(they always matched the indicated year), and sometimes there was no date at all.

It definitely was alarming when I saw this. I haven't tried a line graph yet or a histogram.

richpauloo commented 4 years ago

According to SCOPUS, year is one of the data returned by the query. If it's not there, or seems faulty, perhaps there's something wrong with the query. Let's see what @porefluid says.

richpauloo commented 4 years ago

Hey @neelkandlikar and @NeerajRattehalli I saw the USGS aquifer list -- thanks for uploading that!

How is the code coming along for iteratively matching names, labeling them, discovering new names and making a list of aquifer name variants, etc?

We're still working on getting all the abstracts from all the years.