Closed NeerajRattehalli closed 4 years ago
@NeerajRattehalli - thanks for submitting this. The issue is either in the (1) cleaning code, (2) underlying data, or (3) interpretation of the abstracts-retrieval-response.coredata.prism:coverDate
field.
I find it hard to believe that the 65k abstracts contain only 7 unique years. @NeerajRattehalli, can you double check your work beginning with the raw abstracts and walking through your steps? Or plot as a line with time on x and count on y? The histogram may be doing some automatic binning. Also, is there no other "date" field that may be a publication date? Please check.
@porefluid - can you verify that the scraping script is picking up abstracts from all years, and that @NeerajRattehalli is indeed looking at the correct column in the chunked abstracts? Here's the branch and notebook to consult, and your scraper.
Lastly, @porefluid, in acquiring this data, did you happen upon metadata that may indicate what the prism:coverDate
means? Perhaps it is not the "publication date" that we have in mind.
@NeerajRattehalli - thanks for submitting this. The issue is either in the (1) cleaning code, (2) underlying data, or (3) interpretation of the
abstracts-retrieval-response.coredata.prism:coverDate
field.
I did the value counts and only saw 7 unique years which was kind of suprising...I also checked directly in the abstract chunks and saw the same subset of dates (yyyy-mm-dd) format to be repeated. I might need to look elsewhere, but no other header item clearly indicated a date reference. I saw the date mentioned occasionally in the abstract(they always matched the indicated year), and sometimes there was no date at all.
It definitely was alarming when I saw this. I haven't tried a line graph yet or a histogram.
According to SCOPUS, year is one of the data returned by the query. If it's not there, or seems faulty, perhaps there's something wrong with the query. Let's see what @porefluid says.
Hey @neelkandlikar and @NeerajRattehalli I saw the USGS aquifer list -- thanks for uploading that!
How is the code coming along for iteratively matching names, labeling them, discovering new names and making a list of aquifer name variants, etc?
We're still working on getting all the abstracts from all the years.
I worked on the abstract counts. The final output file should be in
outputs\abstracts-per-year.png
. The data however seems to contain a subset of the years[1963, 1976, 1977, 1988, 1989, 2000, 2014]
. I looked specifically from the following column"abstracts-retrieval-response.coredata.prism:coverDate"
. Am I looking in the wrong column? Or is there somewhere else I need to get the data from? I checked some of the data files manually as well and it seems to be doing the same thing.