petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

create instrument dictionary #29

Closed petermr closed 4 years ago

petermr commented 4 years ago

create a list of instruments used in analysing (but NOT extracting) Essential Oils. This can be used as ground truth for Tiago's extraction sub-project.

Should find this in: "materials and methods"

  <span class="bold">Gas chromatography-mass spectrometry</span>. Samples were analyzed by gas chromatography using a HP6890 instrument coupled with a HP 5973 mass spectrometer. The gas chromatograph is equipped with a split-splitless injector and a Factor FourTM VF-35ms 5% fenil-methylpolysiloxane, 30 m, 0.25 mm, 0.25 μm film thickness capillary column. Gas chromatography conditions include a temperature range of 50 to 250°C at 40°C/min, with a solvent delay of 5 min. The injector was maintained at a temperature of 250°C. The inert gas was helium at a flow of 1.0 mL/min, and the injected volume in the splitless mode was 1 μL. The MS conditions were the following: ionization energy, 70 eV; quadrupole temperature, 100°C; scanning velocity, 1.6 scan/s; weight range, 40-500 amu.

create a new column for GC-MS currently just extract "HP6890" (GC) and "HP 5973" (MS)

<sec id="Sec7" class="sec">
 <div class="title" xmlns="http://www.w3.org/1999/xhtml">GC-MS analysis</div>
 <p xmlns="http://www.w3.org/1999/xhtml">GC-MS chromatograms were recorded using Shimadzu QP-5000 GC-MS. The GC was equipped with Rtx-5 ms column (30 m long, 0.25 μm thickness and 0.250 mm inner diameter). Helium was used as a carrier gas at a flow rate of 1 ml/min. Injector temperature was 220 °C. Oven temperature was programmed from 50 °C (1 min hold) at 5 °C/min to 130 °C, then at 10 °C/min to 250 °C and kept isothermally for 15 min. Transfer line temperature was 290 °C. For GC-MS detection, an electron ionization system, with detector volts of 1.7 KV was used. A scan rate of 0.5 s, and scan speed 1000 amu/s was applied, covering a mass range from 38–450 M/Z.</p>
</sec>

extract "Shimadzu QP-5000 GC-MS"

ambarishK commented 4 years ago

OK sir.

ambarishK commented 4 years ago

Sir, check for the instruments20191006.tsv

Column description is as follows.

Total count of unique records - 95.

Please suggest changes before making dictionary
petermr commented 4 years ago

Good start. There are some misspellings . Are these in the paper or did you mistype them? If they are in the paper that's a good indicator of author errors

On Sun, 6 Oct 2019, 14:06 Ambarish Kumar, notifications@github.com wrote:

Sir, check for the instruments20191006.tsv https://github.com/petermr/CEVOpen/blob/master/dictionary/instrument/instruments20191006.tsv

Column description is as follows.

-

INSTRUMENTS - cleaned names of instruments used into GC-MS analysis.

INSTRUMENTS_NORMALIZED - normalised list of instruments used into GC-MS analysis.

Total count of unique records - 95. Please check for the sheet and suggest changes before making dictionary

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/29?email_source=notifications&email_token=AAFTCSY7L6NNPTLIF2G4B4LQNHPHDA5CNFSM4I5PRCHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAOJVAA#issuecomment-538745472, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFTCSY7TL674TVYBPQAS33QNHPHDANCNFSM4I5PRCHA .

ambarishK commented 4 years ago

Sir, I extracted the text snippet as it is present in the article. There is less chance to be misspelled by me. It is more likely that author has put the name as of extracted onto the sheet .

e.g

at line 34 - INSTRUMENTS column

`Aligent 6890`

Few are duplicates. e.g


line 7 

`Agilent 6890 (GC) and Agilent 5973 (MSD)`

and line 10

`Agilent 6890 (GC) and Agilent 5973 (MS)` 
petermr commented 4 years ago

@ambarishK thanks. Instruments have qualifiers:

Agilent 7890 (GC)   Agilent 7890 (GC)
Agilent 7890 (GC)  and Agilent 5975 (MSD)   Agilent 7890 (GC)  and Agilent 5975 (MSD)
Agilent 7890 (GC) and Agilent 5975 (MS) 
Agilent 7890 N (GC) and Triple Quad 7000 A model mass detector  Agilent 7890 N (GC) and Triple Quad 7000 A model mass detector
Agilent 7890A   
Agilent 7890A (GC)  Agilent 7890A (GC)

We would need a GC-MS expert to tell us whether the letters (A, N) are significant. For the moment I suggest we use them as separate entries

Agilent 7890
Agilent 7890 N
Agilent 7890N

Later we can use a regex to deal with whitespaces.

I will create a first pass dictionary.

ambarishK commented 4 years ago

OK sir.