petermr / CEVOpen

Contentmining of Open phytochemical literature for medicinal activities
26 stars 19 forks source link

Elision of species names in composition table #65

Open deadlyvices opened 4 years ago

deadlyvices commented 4 years ago

We have a minor gremlin in the way that the composition tables are put together. The extraction routine seems to be eliding species names with preceding text:

Table 1Chemical composition, concentrations (%) and calculated retention indices, ofT. boveiessential oil as characterized by GC/MS analysis This is now becoming an issue because I have made a start on processing oil compositions using KNIME by mining the composition*.html files. I'd like to tag the records with species, plant parts etc. This is stopping me from doing it.

When you look at the accompanying summary.html, the names aren't elided at all: image

petermr commented 4 years ago

Yes - I think it's general. It's a whitespace issue and I think we should replace \n with \s . I'll have a spook around...

On Fri, Dec 6, 2019 at 4:19 PM Clyde Davies notifications@github.com wrote:

We have a minor gremlin in the way that the composition tables are put together. The extraction routine seems to be eliding species names with preceding text:

Table 1Chemical composition, concentrations (%) and calculated retention indices, ofT. boveiessential oil as characterized by GC/MS analysis This is now becoming an issue because I have made a start on processing oil compositions using KNIME but I'd like to tag the records with species, plant parts etc. This is stopping me from doing it.

When you look at the accompanying summary.html, the names aren't elided at all: [image: image] https://user-images.githubusercontent.com/10074162/70338096-312b8580-1844-11ea-8d4f-319d348e9128.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/65?email_source=notifications&email_token=AAFTCS5H64OCX7VCGVHH6F3QXJ3RZA5CNFSM4JW3Z5PKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H6V6RQQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS75D5YBLXWFVAIEVWDQXJ3RZANCNFSM4JW3Z5PA .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

petermr commented 4 years ago

Think I have fixed it. Arises in part from text like: this isE. coli<./italic>a bacterium. Have added extra spaces in.

Please check https://github.com/petermr/CEVOpen/tree/master/searches/oil186

P.

deadlyvices commented 4 years ago

I've just done a git pull and I think it's still an issue. image I was thinking that if perhaps you leave the tags in, I can get KNIME to strip them out instead.

MikeWilliams-UK commented 4 years ago

@deadlyvices what folder are you pulling it to?

I have just done a git clone https://github.com/petermr/CEVOpen.git to C:\Temp\oils on my Azure VM (Server 2016) with No Issues

Remember Windows has path length restrictions of 256 characters in certain conditions.

deadlyvices commented 4 years ago

I'm just doing a pull on the repo. It's not the pull that's the problem. It's the data in the composition tables. The species names are still elided.

On Mon, Dec 9, 2019 at 3:44 PM Mike Williams notifications@github.com wrote:

@deadlyvices https://github.com/deadlyvices what folder are you pulling it to?

I have just done a git clone https://github.com/petermr/CEVOpen.git to C:\Temp\oils on my Azure VM (Server 2016) with No Issues

Remember Windows has path length restrictions of 256 characters in certain conditions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/CEVOpen/issues/65?email_source=notifications&email_token=ACM3QMRASZVVOT7IDG4YLGDQXZRUBA5CNFSM4JW3Z5PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGJUEJI#issuecomment-563298853, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACM3QMV2774TI7WVOGKSTRLQXZRUBANCNFSM4JW3Z5PA .

-- Clyde

MikeWilliams-UK commented 4 years ago

NP just thought it was a similar issue poped up again. I did a "dir /s /b > abc.txt" to get just the file names and their paths. A quick and dirty progarm to scan them shows max length is (222 -13) = 209, therefore maximum length of folder name where your repo is cloned to is 47.

I did find 743 lines (files) where the characters were NOT 7 bit ASCII. Does anyone think this might be an issue? I can dump the list here if required.