petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Analyze target organism species in articles #70

Closed petermr closed 4 years ago

petermr commented 4 years ago

Which organisms are targets in the activity of EOs?

petermr commented 4 years ago

extract target organism names

Many of the measured activities are against micro-organisms - bacteria, viruses, fungi, and by extension arthropods (insects) and parasites (helminths/worms, etc.) . This will be broad and m ight include herbicidal activities (but not laboratory animal strains),

 identify sections or tables in which target organisms occur

These are likely to include the words "anti-X" where X is an organism (-bacterial, -fungal, etc.)

sections and tables

section

table

create TSV file with above columns

after (say) 30-50 articles request review

create a table with target organisms, frequency, and Wikidata IDs

See compound table for design

ambarishK commented 4 years ago

Sir, please review the target organism extraction sheet - targetOrganismSpecies20191218.tsv

There is analysis of first 20 articles of oil186.

petermr commented 4 years ago

Thank you. Please make sure there is a separate row for each table (as for compounds) and for sections.

On Tue, Dec 17, 2019 at 10:53 PM Ambarish Kumar notifications@github.com wrote:

Sir, please review the target organism extraction sheet - targetOrganismSpecies20191218.tsv https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/targetOrganismSpecies20191218.tsv

There is analysis of first 20 articles of oil186.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/70?email_source=notifications&email_token=AAFTCS57BP63MFZVXSS4IVLQZFKADA5CNFSM4J35JXJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHEHPHY#issuecomment-566785951, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSZ6AI22GDWOZ557RHLQZFKADANCNFSM4J35JXJA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

OK sir.

petermr commented 4 years ago

I have analyzed the intermediate commit (with ca 100 rows) to extract the species. See https://github.com/petermr/CEVOpen/edit/master/articleAnalysis/oil186/raw/targetOrganismCount.csv @mannyrules for comment

@ambarishK please lookup species in Wikidata and add column for ID

ambarishK commented 4 years ago

OK sir.

ambarishK commented 4 years ago

Sir, I have added column for WIKIDATA ID. targetOrganismCount.csv

Also, include remaining target organisms. Row number 101 and onwards (to 180). targetOrganismSpecies20191218.tsv

Next step would be dictionary making.

petermr commented 4 years ago

Thank you, You can remove the two entries with missing Wikidata IDs "micro-organisms" ans "Robrardoterolla"

P.

On Fri, Dec 20, 2019 at 5:38 AM Ambarish Kumar notifications@github.com wrote:

Sir, I have added column for WIKIDATA ID. targetOrganismCount.csv https://github.com/petermr/CEVOpen/edit/master/articleAnalysis/oil186/raw/targetOrganismCount.csv

Also, include remaining target organisms. Row number 101 and onwards (to 180).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/70?email_source=notifications&email_token=AAFTCSZ4SBUGO33FR3OVQ7LQZRK4JA5CNFSM4J35JXJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHL6BMA#issuecomment-567795888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYZVDOTUIHNM7IMYEDQZRK4JANCNFSM4J35JXJA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

Sir, I have added remaining records of extracted target organism - 40 new records.
Updates are over copy of the target organism extraction sheet. targetOrganismCountCopy.csv

Target organism dictionary file is targetOrganism20191222.xml

Please review these file.

EmanuelFaria commented 4 years ago

I have analyzed the intermediate commit (with ca 100 rows) to extract the species. See https://github.com/petermr/CEVOpen/edit/master/articleAnalysis/oil186/raw/targetOrganismCount.csv @mannyrules for comment

Gentlemen, it seems my notifications weren't working, so I missed this.

@ambarishK bring me up to date regarding the following, please:

  1. Are you still pulling target species out of Oil186, or are you done with that now?
  2. How did you do it (are you doing it)?... manual copy and paste? GREP? other?
  3. If still pulling species, what articles are done? Which remain?
  4. I'm travelling for 7 days (staring tomorrow, Jan/3/2020)... can I do anything (specifically) to help, or have you got this handled without me?

Sorry for the mixup.

Manny

ambarishK commented 4 years ago

Hi Manny.

  1. Target organism extraction from oil186 is complete.

  2. Extraction is done manually.

  3. All articles are covered. I will revisit the extraction sheet as it contains 180 records while articles are 186. I have to verify that if any article is left or not.

  4. https://github.com/petermr/CEVOpen/edit/master/articleAnalysis/oil186/raw/targetOrganismCount.csv - this sheet is extraction of target organisms from oil1000. There is occurrence frequency which is calculated by PMR and I have added the WD ID to target organisms.

  5. I will drop you message as I verify that each article of oil186 is covered for target organism extraction.

I will be available after 4 PM IST.

Also, we have to get together on extracting other entities like techniques, activities etc.

ambarishK commented 4 years ago

As I revised the target organism extraction sheet for the coverage of all 186 articles, I find following missing articles.

PMC5307902 - No activity is discussed into the article.

PMC5524814 – No activity is discussed into the article.
PMC5597067 - Activity is discussed and target organisms are extracted from the section. 
PMC5602841 – No activity is discussed as such against microorganisms.
PMC5694875 – No activity is discussed.
PMC5789316 - Activity is discussed and target organisms are extracted from the section. 
PMC5858457 - Activity is discussed and target organisms are extracted from the section. 

I just add those records and update the target organism extraction sheet.

Confirmation is required from PMR for the updation.

ambarishK commented 4 years ago

Sir, please review the target organism extraction sheet - oil1000TargetOrganismSpecies.tsv.

Please suggest for adding WD ID column for target organisms (format or template for WD ID column) as there are multiple entries into each cell of micro-organism column.

petermr commented 4 years ago

On Sat, Jan 4, 2020 at 6:16 PM Ambarish Kumar notifications@github.com wrote:

Sir, please review the target organism extraction sheet - oil1000TargetOrganismSpecies.tsv https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/oil1000TargetOrganismSpecies.tsv .

Please continue this to oil1000

Also update the table https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/targetOrganismCount.csv and add any new species

Please suggest for adding WD ID column for target organisms (format or template for WD ID column) as there are multiple entries into each cell of micro-organism column.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/70?email_source=notifications&email_token=AAFTCS4BOJSIUBDJAQMLEETQ4DHBRA5CNFSM4J35JXJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIC5GZA#issuecomment-570807140, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYUX2MMMQHIHXU6RCTQ4DHBRANCNFSM4J35JXJA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

OK sir.

petermr commented 4 years ago

Concentrate first on finding all organisms from oil1000 and ad to https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/targetOrganismCount.csv

On Sat, Jan 4, 2020 at 6:16 PM Ambarish Kumar notifications@github.com wrote:

Sir, please review the target organism extraction sheet - oil1000TargetOrganismSpecies.tsv https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/oil1000TargetOrganismSpecies.tsv .

Please suggest for adding WD ID column for target organisms (format or template for WD ID column) as there are multiple entries into each cell of micro-organism column.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/70?email_source=notifications&email_token=AAFTCS4BOJSIUBDJAQMLEETQ4DHBRA5CNFSM4J35JXJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIC5GZA#issuecomment-570807140, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCSYUX2MMMQHIHXU6RCTQ4DHBRANCNFSM4J35JXJA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

OK sir.

ambarishK commented 4 years ago

Sir, I have extracted target organisms from 600 articles of oil1000. At the end of the day, I will complete extraction from 800 articles.

Compilation and processing of extraction sheet will require 3 passes - full name of abbreviation, frequency count and normalization.

I will upload the extraction sheet and add those records after completion.

Example of extracted records are as follows.

PMC5750594                               C. albicans, C. neoformans, A. niger. C. neoformans                            
PMC5750605                                                          
PMC5750654                                                          
PMC5751248                                                          
PMC5761127                                                          
PMC5772139                                                          
PMC5778200                                                          
PMC5778779                                                          
PMC5788217                              Bacillus cereus, Listeria monocytogenes, Micrococcus flavus, Staphylococcus aureus, Dickeya solani, Escherichia coli, Pectobacterium atrosepticum, Pectobacterium carotovorum subsp. carotovorum, Pseudomonas aeruginosa,  Aspergillus flavus, A. ochraceus, A. niger, Candida albicans, Penicillium funiculosum, P. ochrochloron                           
PMC5789270                                                          
PMC5789316                                                          
PMC5794096                                                          
PMC5795983                              Candida krusei, Candida albicans, Candida guilliermondii, Candida parapsilosis, Candida orthopsilosis, Candida metapsilosis, Cryptococcus neoformans, Paracoccidioides brasiliensis, Trichophyton mentagrophytes,  Staphylococcus aureus, Escherichia coli, Pseudomonas aeruginosa                          
PMC5797122                               Aedes aegypti, Anopheles quadrimaculatus, Anopheles albimanus                          
PMC5806308                              Staphylococcus aureus, Bacillus subtilis, Pseudomonas aeruginosa, Candida albicans                          
PMC5807769                              Candida albicans                            
PMC5811758                                                          
PMC5813356                                                          
PMC5822514                                                          
PMC5830750                              Enterococcus faecalis, Staphylococcus aureus, Staphylococcus epidermidis, Proteus mirabilis, Escherichia coli, Pseudomonas aeruginosa                           
PMC5838999                                                          
PMC5842484                                                          
PMC5846372                               Staphylococcus aureus, Staphylococcus epidermidis, Streptococcus mutans, Streptococcus viridans, Escherichia coli, Enterobacter cloacae, Klebsiella pneumoniae, Pseudomonas aeruginosa, Candida albicans, C. tropicalis, C. glabrata                           
PMC5848570                              Staphylococcus aureus, Enterococcus feacalis, Klebsiella pneumoniae, Salmonella paratyphi                           
PMC5849894                                                          
PMC5849899                              A. flavus                           
PMC5849928                              Salmonella typhimurium, Staphylococcus aureus, Escherichia coli                             
PMC5852288                                                          
PMC5855832                                                          
PMC5858069                               Salmonella typhimurium,  B. subtilis, S. epidermidis,  S. mutans, C. albicans, Actinobacillus actinomycetemcomitans, E. faecalis,  Serratia marcescens, S. aureus, M. luteus                           
PMC5858457                              An. stephensi, Ae. aegypti, Cx. quinquefasciatus                            
PMC5859817                              Staphylococcus aureus ATCC 6538 and Pseudomonas aeruginosa                          
PMC5867545                              S. aureus,  P. mirabilis, Streptococci spp., P. aeruginosa, E. coli, Salmonella, Klebsiella spp.                            
PMC5867556                                                          
ambarishK commented 4 years ago

Sir, please go through the extraction sheet for target organisms from oil1000 - oil1000TargetOrganismUnprocessed.csv

It is in unprocessed state right now.

Required processing steps are as follows.

I am processing all above steps.

petermr commented 4 years ago

Thank you

On Sat, 11 Jan 2020, 10:29 Ambarish Kumar, notifications@github.com wrote:

Sir, please go through the extraction sheet for target organisms from oil1000 - oil1000TargetOrganismUnprocessed.csv https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/oil1000TargetOrganismUnprocessed.csv

It is in unprocessed state right now.

Required processing steps are as follows.

  • Full name of abbreviation.
  • Adding WDID.
  • Frequency count
  • Normalization.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/70?email_source=notifications&email_token=AAFTCS2TOHEAZRJ3QQ5Q32LQ5GNQ7A5CNFSM4J35JXJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIV622Q#issuecomment-573304170, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS57BDEAH4FSPCMIIDLQ5GNQ7ANCNFSM4J35JXJA .

ambarishK commented 4 years ago

Sir, please go through the extraction sheet for target organisms from oil1000 - oil1000TargetOrganismsUnique.tsv.

There are 593 unique records.

Please add frequency count of target organism in oil1000.

There are some issues like many records mention genus name. For example -

Acetobacter

Achromobacter

Acidobacteria

Dictyoglomi

Firmicutes

lectularius

Should I go for removing those ones which has only genus name?

I am adding WIKIDATA ID for target organisms.

petermr commented 4 years ago

Thanks, Keep the genus name.

P.

On Mon, Jan 13, 2020 at 11:21 AM Ambarish Kumar notifications@github.com wrote:

Sir, please go through the extraction sheet for target organisms from oil1000 - oil1000TargetOrganismsUnique.tsv https://github.com/petermr/CEVOpen/blob/master/articleAnalysis/oil186/raw/oil1000TargetOrganismsUnique.tsv .

There are 593 unique records.

Please add frequency count of target organism in oil1000.

There are some issues like many records mention genus name. For example -

Acetobacter

Achromobacter

Acidobacteria

Dictyoglomi

Firmicutes

lectularius

Should I go for removing those ones which has only genus name?

I am adding WIKIDATA ID for target organisms.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/70?email_source=notifications&email_token=AAFTCS7FERLXGS4ANEIE6VDQ5RFE5A5CNFSM4J35JXJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIYLFSA#issuecomment-573616840, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6WJLZVQPBNOLHCNLLQ5RFE5ANCNFSM4J35JXJA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

ambarishK commented 4 years ago

OK sir.