glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Curated dataset for Vijay #778

Closed ReneRanzinger closed 9 months ago

ReneRanzinger commented 1 year ago
T1.1 Generate starting dataset
Due Date 12/31/2023
Task owner GW
Dependencies None
Description Set of PMIDs where abstract has been annotated, and extracted manually into a table  
Small groups (20-25 papers) 
May want to include “negative” examples as well
Deliverable CSV table with annotated PMID abstracts 
PMID 
Annotation type (disease, species, tissue, cell line) 
Concept ID (DOID, NCBI Taxon ID …) 
Abstract phrase (text from the abstract that triggered the annotation)

ReneRanzinger commented 1 year ago
  1. Does this mean 20-25 per concept (disease, species, tissue, cell line) or 20-25 total?
  2. Is 20-25 sufficient?
  3. How is negative examples done?
ReneRanzinger commented 1 year ago

From Vijay:

  1. at least 20-25 per concept. but I don't mean 20-25 x 4 abstracts. That is, there could be an abstract that mentions all four types. then it will count as one for each concept.
  2. Ideally more would be better. But 25 should give a reasonable picture of how these tools are working. we can go through a second iteration if we are not confident.
  3. some papers which don't have any instance of a specifc type.
ReneRanzinger commented 1 year ago

Let @Shovan5795 know if you have questions or once you have a few datasets you want him and Vijay to review.

kmartinez834 commented 10 months ago

Updates from 12/14 meeting:

kmartinez834 commented 10 months ago

@Shovan5795 Can you confirm if you'll need the abstract pdf files as well?

Shovan5795 commented 10 months ago

@kmartinez834 if you can give us the PMID, that would be great. We can extract the abstracts from there and process them later.

jeet-vora commented 10 months ago

@ubhuiyan Will be working on this task to generate the csv.

kmartinez834 commented 10 months ago

Email sent to Shovan and Vijay: 12/20/2023 11:02 AM

Thanks Shovan. We've put together an instruction document and excel sheet with examples for Urnisha, who will be doing most of the curation work. Could you and Vijay please take a look and let us know if you have any feedback or changes?

Task T1.1 - Generate starting dataset.docx

literature_mining_t1.1.xlsx

Also, we have a couple more questions:

  1. Negative examples: Do you need papers that are negative for all four concepts? Or, for example, could a paper be positive for disease but negative for tissue?
  2. Concept IDs: Do you need us to map each term to the corresponding DOID/UBERON/Taxonomy ID/Cellosaurus ID? If so, does it need to be exact or can it be a best match/synonym? If we can't find an appropriate UBERON, should we use other ontologies (e.g. BTO)?
  3. Species: Is it acceptable to map to the appropriate species/strain level based on context or domain knowledge? For example, PMID 38081891 mentions the strain "C57BL/6J mice" so I mapped to the species NCBI:txid10090 (Mus musculus). PMID 38105611 mentions "mouse" so I mapped to the genus NCBI:txid10088 (Mus). However, I assume the model wouldn't be able to predict which strains are associated with a species unless it's included as a synonym in your mapping file.

Thanks so much, Karina

kmartinez834 commented 9 months ago

The curated and reference files for Tasks 1.1 and 1.2 are located in this Sharepoint folder: Literature Mining

Direct link to the file: literature_mining_t1.1.xlsx