petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

📕 Documentation: Dictionary.xml and DictionaryDescription.md of: eoAnalysisInstrument (inactive) #15

Open petermr opened 4 years ago

petermr commented 4 years ago

Created simple dictionary by hand (can be incremented later)

<dictionary title="instrument">
<desc>Hacked from a few papers PMR 20190904</desc>
<entry term="HP6890" name="HP6890"/>
<entry term="QP-5000" name="QP-5000"/>
<entry term="QP" name="QP"/>
<entry term="QP2010" name="QP2010"/>
<entry term="QP2010S" name="QP2010S"/>
<entry term="Shimadzu" name="Shimadzu"/>
<entry term="Clevenger" name="Clevenger"/>
</dictionary>

NOTE: term is used for searching (maybe with stemming).

NOTE: these are probably not in Wikidata. Also Clevenger is not an instrument and should be removed.

name is descriptive. title attribute on dictionary must match filename

petermr commented 4 years ago

searching with dictionaries

cd CEVOpen

verify that oil186 is a subdirectory

ls oil186

then search:

ami-search -p oil186 --dictionary species country mydictionaries/instrument.xml 

species is a builtin search, country is a builtin dictionary, mydictionaries/instrument.xml is relative to current directory.

Results are in PMC*/results/search/instrument/results.xml etc. and aggregated in

/some/where/.../CEVOpen/oil186/search.instrument.snippets.xml

as

<projectSnippetsTree>
 <snippetsTree>
  <snippets file="oil186/PMC4391421/results/search/instrument/results.xml">
   <result pre="Ph. Eur. 5.0 [ 3 ], by using a modified" exact="Clevenger" post="apparatus (with the EO collection area cooled to prevent"/>
   <result pre="chromatography-mass spectrometry. Samples were analyzed by gas chromatography using a" exact="HP6890" post="instrument coupled with a HP 5973 mass spectrometer. The"/>
  </snippets>
 </snippetsTree>
 <snippetsTree>
  <snippets file="oil186/PMC5080681/results/search/instrument/results.xml">
   <result pre="500 ml deionized water. Then, the flask was connected with" exact="Clevenger" post="apparatus, which was placed in the same apparatus. While"/>
   <result pre="the fresh weight. GC-MS analysis GC-MS chromatograms were recorded using" exact="Shimadzu" post="QP-5000 GC-MS. The GC was equipped with Rtx-5 ms"/>
   <result pre="fresh weight. GC-MS analysis GC-MS chromatograms were recorded using Shimadzu" exact="QP-5000" post="GC-MS. The GC was equipped with Rtx-5 ms column"/>
  </snippets>
 </snippetsTree>

Each CTree (PMC document) is searched into snippetsTree and the result XML element is in W3C Annotation format (pre, exact, post)

petermr commented 4 years ago

Simple grep that finds mass spec:

grep -r -E -o ".{0,50}mass spectromet{0,50}" PMC*/scholarly.html

will search all the HTML for "mass spectrom" and gives 50 characters either side

lubianat commented 4 years ago

Hello,

I am working on how to migrate the article/instrument matches to Wikidata.

The xml with the excerpts is fantastic, but my xml processing skills are still incipient. I remember having seen in the sprint a summary table with the PMC IDs in one column and counts for each term in another column.

Would you know how I can obtain this summary file?

EDIT: Even though I'm still not able to generate the full html table, I could draft some code to migrate to wikidata from the full table. The code is at https://github.com/caffiendFrog/elife2019/tree/master/wikidatamigration

One of the pages edited: https://www.wikidata.org/wiki/Q44476657

petermr commented 4 years ago

This is wonderful Tiago If you checkout oil186/ You will find fulldatatables.html which I think is what you want

On Thu, 12 Sep 2019, 19:12 Tiago Lubiana, notifications@github.com wrote:

Hello,

I am working on how to migrate the article/instrument matches to Wikidata.

The xml with the excerpts is fantastic, but my xml processing skills are still incipient. I remember having seen in the sprint a summary table with the PMC IDs in one column and counts for each term in another column.

Would you know how I can obtain this summary file?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/15?email_source=notifications&email_token=AAFTCS2SVNRIX3PULJHWQK3QJKBCJA5CNFSM4ITTX33KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6SYUPQ#issuecomment-530942526, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS42G43ZTTEMIDH5LFDQJKBCJANCNFSM4ITTX33A .

petermr commented 4 years ago

Tiago, can you send me your email (by email to peter.murray.rust AT gmail DOT com) so I can connect you with others.

Manny - meet TIago who is in Sao Paulo. Tiago was part of our eLife sprint and worked on the Instruments and how you put this data into Wikidata! So his knowledge will be really valuable for missing Wikidata items. TIago, Manny is in Brasilia and pulling together the CEVOpen project management of extracting plants and their oils from the literature

On Thu, Sep 12, 2019 at 7:59 PM Peter Murray-Rust < peter.murray.rust@googlemail.com> wrote:

This is wonderful Tiago If you checkout oil186/ You will find fulldatatables.html which I think is what you want

On Thu, 12 Sep 2019, 19:12 Tiago Lubiana, notifications@github.com wrote:

Hello,

I am working on how to migrate the article/instrument matches to Wikidata.

The xml with the excerpts is fantastic, but my xml processing skills are still incipient. I remember having seen in the sprint a summary table with the PMC IDs in one column and counts for each term in another column.

Would you know how I can obtain this summary file?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/15?email_source=notifications&email_token=AAFTCS2SVNRIX3PULJHWQK3QJKBCJA5CNFSM4ITTX33KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6SYUPQ#issuecomment-530942526, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCS42G43ZTTEMIDH5LFDQJKBCJANCNFSM4ITTX33A .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK