millerse / US-National-Parasite-Collection

0 stars 0 forks source link

many unmatched names in parasite collection #3

Open jhpoelen opened 7 years ago

jhpoelen commented 7 years ago

Hi!

I noticed that only about 50% of all names occurring in the US national parasite collection match to external taxonomies. Could you poke (spot check) around and see whether the names used are in fact outdated or no longer in usage by (meta-)taxonomies?

I am curious to see whether the issue is due to bugs in GloBI or "bugs" in the dataset.

millerse commented 7 years ago

Yeah, it looks like a lot of the names are old, misspelled or badly written (ARDUENNA =ASCAROPS STRONGYLINA). How should I proceed with this?

jhpoelen commented 7 years ago

One way to go about it, is to communicate the list of suspicious name (subset of taxonUnmatched from http://globalbioticinteractions.org/references to the US National Parasite Collection curators and ask them to look at it and, ideally, fix it on their end. Are the curators in your building?

jhpoelen commented 7 years ago

Perhaps @jhammock or @katjaschulz have some ideas.

jhammock commented 7 years ago

I'm inclined to present the unmatched names output to Anna Phillips and see if she has any ideas. That is how this data sharing thing is supposed to work in an academic utopia....

millerse commented 7 years ago

In the mean time, I can switch the data connector to a static one. In the mean time, I can remove the problem names and keep that in a dataset that is stored online. Does that sound good?

jhpoelen commented 7 years ago

I kind of like the idea to keep the suspicious names in the dataset - this allows for poking around the funny names and include them in unmatched taxa report. I am curious to hear what Anna Phillips would need to help resolve the names to modern day taxonomy services. This is assuming she's willing to have a look and make changes in the first place ; ) .

millerse commented 7 years ago

@jhpoelen This may be a stupid question, but is there a way for me to keep the schema.json formula and have the url in the globi.json be linked to an archived copy of a cleaned table? I have been trying to do this for the past two hours with no success.

jhpoelen commented 7 years ago

You should be able to change url. The url would have to point to a data file. A link to a google doc is probably not going to work, because it would just render a web page (html) instead of downloading a csv file.

Just curious - what do you mean by clean version? Did you correct the suspicious names or leave them out ?

millerse commented 7 years ago

I moved the suspicious names to a second list.

jhpoelen commented 7 years ago

Cool! I would suggest to keep both in the globi.json, so that we keep it on the radar. Something like the example below.

Two unrelated comments - (1) try to avoid spaces in filenames - they tend to get confusing. Instead use _ or -. (2) when referring a file within the repository, you can omit the http:// prefix. This also helps GloBI to use the specific version in the repository.

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "rdfs:comment": ["inspired by https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/"],
  "tables": [
    { "url": "cleaned_up.tsv",
      "dcterms:bibliographicCitation": "http://invertebrates.si.edu/parasites.htm",
      "tableSchema": "schema.json",
      "headerRowCount": 1,
      "interactionTypeId": "http://purl.obolibrary.org/obo/RO_0002444",
      "interactionTypeName": "parasiteof",
      "null": ["9999999998.0"]
    },
   { "url": "errors.tsv",
      "dcterms:bibliographicCitation": "http://invertebrates.si.edu/parasites.htm",
      "tableSchema": "schema.json",
      "headerRowCount": 1,
      "interactionTypeId": "http://purl.obolibrary.org/obo/RO_0002444",
      "interactionTypeName": "parasiteof",
      "null": ["9999999998.0"]
    }

      ]
}
millerse commented 7 years ago

@jhpoelen Can you help me with the build continuing to fail?

I keep getting this message:

ERROR StudyImporterForGitHubData:63 - failed to import data from repo [millerse/US-National-Parasite-Collection] org.eol.globi.data.StudyImporterException: read [4] columns, but found [43] column definitions. at org.eol.globi.data.StudyImporterForMetaTable.importAll(StudyImporterForMetaTable.java:267) at org.eol.globi.data.StudyImporterForMetaTable.importTable(StudyImporterForMetaTable.java:127) at org.eol.globi.data.StudyImporterForMetaTable.importStudy(StudyImporterForMetaTable.java:67) at org.eol.globi.data.StudyImporterForGitHubData.importData(StudyImporterForGitHubData.java:86) at org.eol.globi.data.StudyImporterForGitHubData.importData(StudyImporterForGitHubData.java:59) at org.eol.globi.tool.GitHubRepoCheck.main(GitHubRepoCheck.java:86)

jhpoelen commented 6 years ago

hi @millerse - looks like the build is passing now. Could you please make a new release and submit to zenodo GloBI community. This way, GloBI can use your recent work, rather than using the first May 2017 release.