My initial idea of gndiff was comparing two files so its .csv input design is perfect for that:
I call the executable from a Python script.
That script has to generate reference.csv and source.csv files
Then it calls gndiff command line executable passing those filenames
It catches the output and processes its content from Python again.
That just works.
Now I am wondering about some other possible use cases with frequent repetitive gndiff calls (also from Python).
My concern is whether so many disk-write/disk-read of .csv files could/should be avoided.
Figure out my script is parsing a long list of new specimens to include in a museum collection.
I might prefer to gndiff-match them one by one, for whatever reason (my script could need to make other intermediate tasks in a certain order before processing next specimen name).
So I would be passing gndiff a small source.csv with just one row, but so many times.
In such scenario, would it make sense not creating a source.csv in disk (which means a Python file-write, plus a gndiff file-read), but somehow passing the source info as a parameter instead?
Maybe this is already possible somehow although I am not sure about what syntax should I try.
Or maybe this doesn't make sense at all because the script performance would be similar (i.e. the intermediate tasks are slower than gndiff call).
Of course, I can always design my script to process all gndiff-matching operations in advance.
I am just thinking before scripting and I am not a professional, so don't take me too serious.
Somehow related to this, in #13 I suggested the possibility of using gndiff as a server (so we can run gndiff in one machine and call it from others).
If that feature ever becomes possible, I wonder how such a server would work.
I guess the idea is repeating exactly the same stuff (gndiff receiving two files, doing the work and returning the output as http response)
But another possible scenario is running it as a server of a predefined reference list: so reference.csv is not passed in http requests, but defined at server start time ... so the requests only contain a list of source taxa (or just one taxon) to match against that reference.csv ... so again, the server could be receiving small but repetitive matching tasks.
Hi @dimus This is more a question than an issue.
My initial idea of gndiff was comparing two files so its .csv input design is perfect for that:
That just works.
Now I am wondering about some other possible use cases with frequent repetitive gndiff calls (also from Python). My concern is whether so many disk-write/disk-read of .csv files could/should be avoided.
Figure out my script is parsing a long list of new specimens to include in a museum collection. I might prefer to gndiff-match them one by one, for whatever reason (my script could need to make other intermediate tasks in a certain order before processing next specimen name). So I would be passing gndiff a small source.csv with just one row, but so many times.
In such scenario, would it make sense not creating a source.csv in disk (which means a Python file-write, plus a gndiff file-read), but somehow passing the source info as a parameter instead? Maybe this is already possible somehow although I am not sure about what syntax should I try. Or maybe this doesn't make sense at all because the script performance would be similar (i.e. the intermediate tasks are slower than gndiff call).
Of course, I can always design my script to process all gndiff-matching operations in advance. I am just thinking before scripting and I am not a professional, so don't take me too serious.
Somehow related to this, in #13 I suggested the possibility of using gndiff as a server (so we can run gndiff in one machine and call it from others). If that feature ever becomes possible, I wonder how such a server would work.
Just wondering