first impressions and questions

abubelinha commented 2 years ago

First of all, thanks a lot for creating gndiff. It's gonna be so useful for me.

I have just tried with a small file, to get the feeling of how it works. I already found some issues to comment:

Unclear input file formats description, where it says "Prepare two files with names. There are 3 possible file formats:" ... but actually, only two formats are mentioned: (1) simple list, one name per line; (2) csv file, with some other fields (see below). Also, it is unclear to me if the CSV format applies only to the reference.csv file or also to source.csv:
- Names to be matched (source.csv), might also contain their own ids, but it is unclear to me whether gndiff is suggesting user to provide them or not (i.e., for adding them as a new column in output, so it is easier for me to rejoin that output against my original database).
  I think that is not the idea, because output already provides an autonumeric index. So I understand source.csv would usually contain just one field, with names and nothing else (just one column). With one possible exception:
- Except in case of using Family: I suppose in that case it should be present in both files. Correct? But I couldn't make it work properly in my tries (see below):
I might have misunderstood input CSV format description above. But if Family and TaxonID are optional fields?, then JSON output contains errors sometimes: 1.. If I don't provide a Family column in reference.csv, then json output referenceRecords[n].family contains the same value as name (the ScientificName field provided in my reference.csv file). 2.. If I provide a Family column in reference.csv (even with empty values), then json output seems correct (referenceRecords[n].family contains those family values I provided). 3.. But if I also provide a Family in source.csv, then json output includes a new sourceRecord.id which does contain the same value as sourceRecord.name. 4.. If source.csv contains other columns (i.e., ScientificName + LifeForm) then json output produces sourceRecord.family=sourceRecord.id=sourceRecord.name (all containing the ScientificName provided in source.csv).

So I am a bit confused. I think it would be worth to provide a couple of sample input files, and explicitly say if they can/should contain some other columns or not.

Regarding family: A real example case of how "tricky homonyms where family helps to resolve taxa from each other" would be useful too (I think family is not going to solve anything in my case, but just to be sure). I wonder how this "use family" option affects speed: does it make matching faster or slower for large datasets?
CSV/TSV outputs are missing column headers? This could seem irrelevant, but it makes a bit difficult to check if the output content is correct. Also, I cannot proceed with further tasks, like merging this output with other tabular data by means of column joins (I can try to figure out headers and add them myself ... but it would be safer if gndiff did it to avoid mistakes).

EDIT: I have just realized that some of the above suggestions were already addressed by @Adafede in a previous closed issue (#12). Sorry about that. My comments are pretty verbose, so @dimus might still find some helpful feedback in some of them. This is a new one:

Shouldn't the output include a sort of calculated numeric similitude between the matched names? I have some cases where json produces several "Exact" matches (i.e. two referenceRecords for the same sourceRecord) because my reference.csv contains two similar versions of the same name (i.e. a subsp. rank vs. a var. rank, identical in everything else). But my source.csv only contains one (i.e. the subsp.). How can I make the decision to select the most similar in these cases? I will better post an example in a new comment to illustrate this.

Thanks a lot in advance !!

abubelinha commented 2 years ago

To illustrate a bit more my results above, I will paste the example 2 of my previous comment (I provided an empty family in reference.csv):

gndiff_source.csv : (A, B, C added here for reference: they were not part of the file)

ScientificName
A. Obione maritima (Alfredo) Pacino var. maritimaa
B. Obione maritima (Alfredo) Di'Stefano subsp. maritima
C. Quercus lamarkensis

gndiff_reference.csv :

TaxonID,Family,ScientificName
1001,,Obione maritima (Alfredo) Pacino var. maritima
1002,,Obione maritima (Alfredo) Pacino subsp. maritima

It is a minimal example, but I want to illustrate that my reference list wouldn't be a simple list of very different accepted names. Actually, it will include very similar names which might be or not synonymous of other names in the same list (acceptedID is of course missing in the example).

TEST 1: JSON

*** COMMAND: C:\gnames\gndiff gndiff_source.csv gndiff_reference.csv -f pretty
*** OUTPUT: (
 {
  "Matches": [
    {
      "sourceRecord": {
        "dataSet": "gndiff_source",
        "index": 1,
        "name": "Obione maritima (Alfredo) Pacino var. maritimaa"
      },
      "referenceRecords": [
        {
          "dataSet": "gndiff_reference",
          "index": 1,
          "editDistance": 1,
          "id": "1001",
          "name": "Obione maritima (Alfredo) Pacino var. maritima",
          "matchType": "Fuzzy"
        },
        {
          "dataSet": "gndiff_reference",
          "index": 2,
          "editDistance": 1,
          "id": "1002",
          "name": "Obione maritima (Alfredo) Pacino subsp. maritima",
          "matchType": "Fuzzy"
        },
        {
          "dataSet": "gndiff_reference",
          "index": 1,
          "editDistance": 1,
          "id": "1001",
          "name": "Obione maritima (Alfredo) Pacino var. maritima",
          "matchType": "Fuzzy"
        },
        {
          "dataSet": "gndiff_reference",
          "index": 2,
          "editDistance": 1,
          "id": "1002",
          "name": "Obione maritima (Alfredo) Pacino subsp. maritima",
          "matchType": "Fuzzy"
        }
      ]
    },
    {
      "sourceRecord": {
        "dataSet": "gndiff_source",
        "index": 2,
        "name": "Obione maritima (Alfredo) Di'Stefano subsp. maritima"
      },
      "referenceRecords": [
        {
          "dataSet": "gndiff_reference",
          "index": 1,
          "id": "1001",
          "name": "Obione maritima (Alfredo) Pacino var. maritima",
          "matchType": "Exact"
        },
        {
          "dataSet": "gndiff_reference",
          "index": 2,
          "id": "1002",
          "name": "Obione maritima (Alfredo) Pacino subsp. maritima",
          "matchType": "Exact"
        }
      ]
    },
    {
      "sourceRecord": {
        "dataSet": "gndiff_source",
        "index": 3,
        "name": "Quercus lamarkensis"
      },
      "referenceRecords": null
    }
  ]
}

TEST 2: CSV

*** COMMAND: C:\gnames\gndiff gndiff_source.csv gndiff_reference.csv -f csv**
*** OUTPUT:
 gndiff_source,1,,Obione maritima (Alfredo) Pacino var. maritimaa,gndiff_reference,Fuzzy,1,1001,Obione maritima (Alfredo) Pacino var. maritima,1
gndiff_source,1,,Obione maritima (Alfredo) Pacino var. maritimaa,gndiff_reference,Fuzzy,2,1002,Obione maritima (Alfredo) Pacino subsp. maritima,1
gndiff_source,1,,Obione maritima (Alfredo) Pacino var. maritimaa,gndiff_reference,Fuzzy,1,1001,Obione maritima (Alfredo) Pacino var. maritima,1
gndiff_source,1,,Obione maritima (Alfredo) Pacino var. maritimaa,gndiff_reference,Fuzzy,2,1002,Obione maritima (Alfredo) Pacino subsp. maritima,1
gndiff_source,2,,Obione maritima (Alfredo) Di'Stefano subsp. maritima,gndiff_reference,Exact,1,1001,Obione maritima (Alfredo) Pacino var. maritima,0
gndiff_source,2,,Obione maritima (Alfredo) Di'Stefano subsp. maritima,gndiff_reference,Exact,2,1002,Obione maritima (Alfredo) Pacino subsp. maritima,0
gndiff_source,3,,Quercus lamarkensis,,NoMatch,,,,

*** COMMAND: c:\gnames\gndiff -V
version: v0.1.1
build:   2021-12-28_02:44:39UTC

COMMENTS:

Why do I get up to 4 matches for name A (sourceRecord.index=1), when my reference.csv only has 2 rows? Looks like all Fuzzy matches are simply repeated twice? Why do all of them have an editDistance=1? I would expect name 1001 (rank var.) to have a different value than 1002 (rank subsp.), because the searched name is has subspecific rank.
Why do I get "Exact" matches for name B (sourceRecord.index=2, rank subsp.), if its authors are different to those of both reference.csv names?
I would expect a Fuzzy match, where name 1002 (rank subsp.) should be a closer match than name 1001 (rank var.). Is it possible to have some numeric output information (similar to editDistance) which lets me make a decision to select name 1002 over 1001?
As I reported above, CSV output is a bit confusing without headers. For example, I can't figure out the meaning of the numeric values (0 | 1) in the right column.

I am mostly concerned about the possibilities of getting the "best choice" in 1 & 2. But there are no numeric differences which could help in that task (specially when editDistance is not available). I only found info about the editDistance in the source code. Is there a detailed explanation of this value somewhere?
I guess it is only calculated for fuzzy matches, but I couldn't figure out how different 2 names should be in order to get a different editDistance value (i.e. 0.8, 0.9) ... can I see some examples? I tried some variations but couldn't produce a value different to 1.

Biased by my little experience with other gnames applications, I expected gndiff JSON output to be much more verbose than it is. (as an option, at least). As names are being parsed for the matching process, I would be interested in taking profit of the parsed stuff in my output (for both source.csv and reference.csv names). I miss many fields which could help us to make decisions when several matches are found:

from gnparser: quality, canonical, authors, cardinality, or even details (-d) such as details.infraspecies.infraspecies.rank (would help to solve my example 2). Also warnings, like unparsed tails and so on.
from gnverifier>0.6: I find really interesting those new and very informative scoreDetails showing numeric information about the matching process. They could also help me to make decisions, if I prefer to give more importance to one part of the name or another.

Of course, I mention all this because I assume all that info is actually being generated during gndiff matching process. So it wouldn't be difficult to (optionally) output everything, I hope.

So, to summarize FEATURE REQUESTS:

A "return bestmatch only" CLI option (so if my source.csv has 3 rows, gndiff should return 3 records only). Otherwise, return all matches.
More detailed output information, as a CLI option.
CSV needs headers.
As a bonus: Let this program run as a server on port xxxx. This might solve some previous issues:
- Use it as a small matching server for a particular reference list: so an indepent web server app can pass gndiff a list of names and check it against a particular reference.csv of local interest. I guess this would be much simpler than other solutions (like a full gnames installation).
- Use it from incompatible machines (in my case WinXP)

Sorry for the long explanations. I hope some of all this verbosity makes sense.

Thanks a lot for all your help, and sorry for not having tested gndiff much earlier (got some problems receiving github email notifications, so I missed many things during the last month)

dimus commented 2 years ago

@abubelinha, I got back to gndiff finally and going through your comment.

I will try to modify README according to your notes and answers to other topics are below. Also see new opened issues. If something is missing, please create a new issue (one per topic), if something is there but needs to be refined, please add your feedback there.

Fields from source

Names to be matched (source.csv), might also contain their own ids, but it is unclear to me whether gndiff is suggesting user to provide them or not

It is definitely a possibility. I suspect a good approach would be to keep all fields from the source, and just enough fields from the reference (as it is now). I can make it as an option.

Source can have any number of fields, same as reference.

Field 'Family' is not implemented yet. If it is given in both files, it will be showsn for both files.

Output bug

If I don't provide a Family column in reference.csv, then json output referenceRecords[n].family contains the same value as name (the ScientificName field provided in my reference.csv file).

This is a bug, I will make an issue

Family field and the speed of execution.

No effect on speed at all. Families just show in the output for manual comparison by the user.

Calculated score

yes, it does make sense. Currently score is used only for sorting results, it also can be set to return a value like in resolver.

Why do I get up to 4 matches for name A (sourceRecord.index=1), when my reference.csv only has 2 rows?

Sounds like a bug

Why do I get "Exact" matches for name B (sourceRecord.index=2, rank subsp.), if its authors are different to those of both reference.csv names?

Exact match is made by canonical forms, and authorship is used to pick better results.

Why "subsp" get "var" as best result:

It is a bug, I have to find out why it happens.

More verbose output in JSON, parser data.a

Makes sense, I'll make a ticket.

Edit distance is a Levenshtein edit distance between simple canonical forms.
Run it as a server with resource list preloaded.

Interesting idea, yes, can be done too.

@abubelinha, thank you a lot for your feedback, a lot of good ideas and bug reports.

abubelinha commented 2 years ago

You're welcome. Thanks a lot to you indeed for all your awesome work

gnames / gndiff

first impressions and questions #13