idigbio-api-hackathon / Species-interaction-data-extraction

place for data and code for mining species interactions from specimen data
0 stars 0 forks source link

Data #1

Open seltmann opened 9 years ago

seltmann commented 9 years ago

@jhpoelen @mjcollin I added a small dataset of parasitioid data here. I can provide messier data as well, but thought this might be a good start. Its the aec:associatedFloraNotes and aec:associatedFaunaNotes fields that are the verbatim text from the labels.

jhpoelen commented 9 years ago

thanks @seltmann . . . hoping to put it to use (at least as an example) in the next couple of days. Do you happen to have a method paper of sorts that describes your text extraction process?

seltmann commented 9 years ago

@jhpoelen no I dont have a methods paper, although the general idea I tried was the same as the Plos ONE, "Utilizing Descriptive Statements from BHL" paper. Its semi-manual, as proposed structures are offered to a person to review, and somehow the script "learns" from the decisions the person makes. I did this in a very rudimentary way by creating dictionaries of meaningful words, and stop word lists.

I first went through the notes and create a dictionary of words/definitions for the association relationships based on the dataset I am using (since all note sets are somewhat different depending on taxa). That was done by randomly picking notes and having a person define what part of the note was meaningful, and its definition (or mapping to an ontology term). Ex. is latin for "from", and I think in the majority of cases particularly with parasitiods it can be assumed that would map to "emerged_from". Cool thing is that there is actually a great deal of repetition, so it takes very few instances in the dictionary, but the challenge then comes with all of the variations on the dictionary entry. So Ex. could be entered as "Ex ","Ex.","ex", or "ex." for example. If it is only a scientific name, or the default is the generic "associates_with".

A second dictionary was created based on insect scientific names (host names) and plant scientific names. I used taxon names we had in our database, but I suspect a name service could help with this, although I did not try to incorporate one.

I hard coded some observations I know about how folks write host information on labels. So if a person uses ex. the following words are most often the name of the host, and that host is an insect for this dataset based on its biology. It would be very interesting to have some learning from the script in label structure as well, but I never went that far.

debpaul commented 9 years ago

Hi Katja,

Wish you where here at hackathon! See list of pitches...we're working on right now.

https://docs.google.com/document/d/1ushqk5r5llQVVEcYNhOehiVIVHZG51vbG-Yau1L1oHY/edit?usp=sharing

:-) Deb

On 6/3/2015 9:11 AM, Katja Seltmann wrote:

@jhpoelen https://github.com/jhpoelen no I dont have a methods paper, although the general idea I tried was the same as the Plos ONE, "Utilizing Descriptive Statements from BHL" paper. Its semi-manual, as proposed structures are offered to a person to review, and somehow the script "learns" from the decisions the person makes. I did this in a very rudimentary way by creating dictionaries of meaningful words, and stop word lists.

I first went through the notes and create a dictionary of words/definitions for the association relationships based on the dataset I am using (since all note sets are somewhat different depending on taxa). That was done by randomly picking notes and having a person define what part of the note was meaningful, and its definition (or mapping to an ontology term). Ex. is latin for "from", and I think in the majority of cases particularly with parasitiods it can be assumed that would map to "emerged_from". Cool thing is that there is actually a great deal of repetition, so it takes very few instances in the dictionary, but the challenge then comes with all of the variations on the dictionary entry. So Ex. could be entered as "Ex ","Ex.","ex", or "ex." for example. If it is only a scientific name, or the default is the generic "associates_with".

A second dictionary was created based on insect scientific names (host names) and plant scientific names. I used taxon names we had in our database, but I suspect a name service could help with this, although I did not try to incorporate one.

I hard coded some observations I know about how folks write host information on labels. So if a person uses ex. the following words are most often the name of the host, and that host is an insect for this dataset based on its biology. It would be very interesting to have some learning from the script in label structure as well, but I never went that far.

— Reply to this email directly or view it on GitHub https://github.com/idigbio-api-hackathon/Species-interaction-data-extraction/issues/1#issuecomment-108399685.

-- Upcoming iDigBio Events https://www.idigbio.org/calendar -- Deborah Paul, iDigBio Technology Specialist Institute for Digital Information, 234 LSB Florida State University Tallahassee, Florida 32306 850-644-6366