gnames / gnfinder

GNfinder finds scientific names in UTF8 texts, PDF files, MS Word/Excel documents, URLs etc.
MIT License
44 stars 5 forks source link

best way to use gnfinder to find names in tabulated data and get results tabulated as in origin #120

Open abubelinha opened 2 years ago

abubelinha commented 2 years ago

Hello

I am planning to use gnfinder to process a column from a table with about 2500 rows.

ID LABEL
1 Blah blah blah _ScientificnameA blah blah _ScientificnameB blah blah
2 _ScientificnameC bleh blah blah _ScientificnameA
... ...
2500 Blah blah blih blah _ScientificnameX bluh blah blah _ScientificnameF blah blah

So, in fact, what I need to pass in to gnfinder is each cell of the second column, to extract names from it and return matches against some preferred name sources. But of course, I need to keep the returned info associated to each specimen ID (1st column in my table).

I was planning to use the API but I suppose I could try to use the CLI if it is more suitable to this purpose.

Thanks a lot

EDIT: not sure if this has relation to https://github.com/gnames/gnfinder/issues/56 but I am not using R dataframes. Just processing a CSV file in Python.

dimus commented 2 years ago

Hi @abubelinha, one way you can do it locally is to set a pipe in python to talk to command liine gnfinder on you computer. It would be similar to https://github.com/gnames/gnparser#pipes

2500 separate calls to API also does not sound too strenuous for the service.

abubelinha commented 2 years ago

Thanks @dimus But I guess even using pipes, this would imply 2500 local gnfinder pipe calls, wouldn't it? (which again means 2500 online requests when verification is turned on, correct?) I would prefer to use one call, just in case I end up using this technic for something much bigger in the future.

Anyway, I had not realized that gnfinder returns start/end position of each name found in the long text string. That could be so useful for my use case. Perhaps creating a couple of new calculated columns in my table, label_length, plus cummulative_labels_length, and then concatenating all labels' cells and passing them to gnfinder as a single long string ... I might be able to match found names against the correct rows by comparing returned start & end values of each name against these two columns' values

dimus commented 2 years ago

If you do not mind to use the start/end positions, all should work in one go. However, take in account #38. If your file is tab-separated, all will work, if it is comma-separated, you would probably need to preprocess the file and add a space after commas.

abubelinha commented 2 years ago

Good point! As I am generating the original CSV I can control its format and make it tab-separated. Anyway, what I am passing to gnfinder is only one column (see LABEL column in table above), with all rows concatenated, like this (so no column separators affecting here):

"Blah blah blah Scientificname_A blah blah Scientificname_B blah blah|Scientificname_C bleh blah blah Scientificname_A|Blah blah blih blah Scientificname_X bluh blah blah Scientificname_F blah blah"

I use | symbols here to show you the limits between original colum cells (from up to down). But if I concatenate them, those symbols are not present in the text passed to gnfinder ... or should I better use them? Which character would you use (if any) to separate content from contiguous cells, before feeding gnfinder?

I try to figure out what will happen if the taxon name is just at the end or beginning of the cell (if no separator is added, then both names will be concatenated).

Perhaps a space before and after separator would be better? (so 3 characters instead of just one)

dimus commented 2 years ago

originally gnfinder was made to detect names in BHL, so it uses a space of any kind as a separator between words. The | characters should not affect anything, as long as there is a space after them.

dimus commented 2 years ago

several spaces are ok

dimus commented 2 years ago

CSV and TSV files should work fine, because they are going to be normalized to a plain text with spaces.