globalbioticinteractions / elton

Access, review and index existing species interaction datasets
GNU General Public License v3.0
3 stars 2 forks source link

Using elton interactions from a DwC URL #47

Open zedomel opened 3 years ago

zedomel commented 3 years ago

Hi @jhpoelen

following what we have discussed about indexing biotic interactions from GBIF, I have some questions which may demand adding new features do elton. Let's see.

I'm using the following command to get all DwC-Archives from deeplinerk.bio:

preston ls --remote https://deeplinker.bio/c253a5311a20c2fc082bf9bac87a1ec5eb6e4e51ff936e7be20c29c8e77dee55 --log tsv --no-cache | grep 'application/dwca' | cut -f1

It gives me a list of URL's of DwC-Archives:

http://tb.plazi.org/GgServer/dwca/1C546649D866E731FF8B2771487AD818.zip
http://tb.plazi.org/GgServer/dwca/E376FF8EFFF1F22C326D1E0DFF8BFFDF.zip
https://nzobisipt.niwa.co.nz/archive.do?r=westpac_chromis
https://nzobisipt.niwa.co.nz/archive.do?r=nearshorereeffishes
https://nzobisipt.niwa.co.nz/archive.do?r=mpi_tag

Now, I trying to extract interaction data from these archives using elton:

preston ls --remote https://deeplinker.bio/c253a5311a20c2fc082bf9bac87a1ec5eb6e4e51ff936e7be20c29c8e77dee55 --log tsv --no-cache | grep 'application/dwca' | cut -f1 | elton interactions

c253a5311a20c2fc082bf9bac87a1ec5eb6e4e51ff936e7be20c29c8e77dee55 is the hash of the latest bio graph.

But the way which elton works (as far I know) I need run elton init before passing the --data-url and --data-citation to create the globi.json file and then set format: "dwca".

I'm wondering if there is some way to skip elton init and use elton interactions to extract all interactions from the dwca's.

Maybe then I can do something like

preston ls --remote https://deeplinker.bio/c253a5311a20c2fc082bf9bac87a1ec5eb6e4e51ff936e7be20c29c8e77dee55 --log tsv --no-cache | grep 'application/dwca' | cut -f1 | elton interactions | awk -F\t '$27!=""'

the awk -F\t '$27!=""' is appended to the command in order to get only "complete" interactions records, since elton will output records with empty targetTaxonName (field number 27) when it can't find any interactions (the DwC-A doest not contains any data for associatedTaxa for example).

In parallel, I'm editing the scripts in https://github.com/bio-guoda/preston-scripts to store these DwC-A into a AWS EMR facilty.

Additionally, for the kind of analysis that I'm trying to do, will be interest to know in which DwC fields the interactions are stored (associatedTaxa, associatedOccurrence, ResourceRelationship). Is there any way to get that information too?

thanks.

jhpoelen commented 3 years ago

@zedomel thank you for sharing your idea to improve elton by allowing to extract interactions from DarwinCore archives without first having to make a globi.json configuration file via elton init ... (or manually).

I need a little time to think about how to implement this. Hoping to get back to this sooner rather than later.

I also much like the other (unrelated) activities you mentioned - importing https://deeplinker.bio into AWS EMR and querying the values from associatedTaxa, associatedOccurrences and resource relationship tables.

I've created separate issues for these in the preston repository:

https://github.com/bio-guoda/preston/issues/114 -> AWS EMR

https://github.com/bio-guoda/preston/issues/115 -> querying for specific values in associatedTaxa, associatedOccurrences, resource relationships

Do you mind continuing the conversation about these neat feature ideas / activities there?