jhpoelen / specimen-image-index

scripts to help index images of specimen in natural history collections around the world
Creative Commons Zero v1.0 Universal
0 stars 1 forks source link

some dwca with multimedia are present, but not detected #1

Closed jhpoelen closed 1 year ago

jhpoelen commented 1 year ago

as suspected by @timrobertson100 , it appears that the make.sh is leaving behind some dwca that do reference images.

using his comment in https://github.com/bio-guoda/preston/issues/168#issuecomment-1221098024 , I inspected the largest collection https://www.gbif.org/dataset/b5cdf794-8fa4-4a85-8b26-755d087bf531 or MNHN, Chagnoux S (2022). The vascular plants collection (P) at the Herbarium of the Muséum national d'Histoire Naturelle (MNHN - Paris). Version 69.272. MNHN - Museum national d'Histoire naturelle. Occurrence dataset https://doi.org/10.15468/nc6rxy accessed via GBIF.org on 2022-08-19.

Tracing the provenance of the source data:

preston cat --remote https://linker.bio hash://sha256/da7450941e7179c973a2fe1127718541bca6ccafe0e4e2bfb7f7ca9dbb7adb86 | grep 'http://collections.mnhn.fr/ipt/archive.do?r=mnhn-p>' | grep hasVersion | head -n1

produced:

<http://collections.mnhn.fr/ipt/archive.do?r=mnhn-p> <http://purl.org/pav/hasVersion> <hash://sha256/94bab7bcf80483c3fb97d2faa00b68e49ae9f3750694f9870d029d6d12157e9f> <urn:uuid:e0002340-72f5-4f00-a3a2-573a6f8870ee> .

indicating that the dataset was included.

And, the content id appeared in one of the intermediate datasets, but just one.

$ grep "94bab7bcf80483c3fb97d2faa00b68e49ae9f3750694f9870d029d6d12157e9f" *_da74.tsv
content_da74.tsv:hash://sha256/94bab7bcf80483c3fb97d2faa00b68e49ae9f3750694f9870d029d6d12157e9f da74    2022-07-01

This suggested that the datasets was not recognized as a dataset with a multimedia extension.

So, suspect that https://github.com/jhpoelen/specimen-image-index/blob/main/has-multimedia.sh is not sufficient in detecting multimedia extensions.

Suggest to add more rules to widen the net to include meta.xml that look like:

preston cat --remote https://linker.bio 'zip:hash://sha256/94bab7bcf80483c3fb97d2faa00b68e49ae9f3750694f9870d029d6d12157e9f!/meta.xml' 
<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
  <core encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
    <files>
      <location>occurrence.txt</location>
    </files>
    <id index="0" />
    <field index="1" term="http://purl.org/dc/terms/modified"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/institutionCode"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/collectionCode"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/basisOfRecord"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
    <field index="6" term="http://rs.tdwg.org/dwc/terms/catalogNumber"/>
    <field index="7" term="http://rs.tdwg.org/dwc/terms/occurrenceRemarks"/>
    <field index="8" term="http://rs.tdwg.org/dwc/terms/recordNumber"/>
    <field index="9" term="http://rs.tdwg.org/dwc/terms/recordedBy"/>
    <field index="10" term="http://rs.tdwg.org/dwc/terms/individualCount"/>
    <field index="11" term="http://rs.tdwg.org/dwc/terms/sex"/>
    <field index="12" term="http://rs.tdwg.org/dwc/terms/lifeStage"/>
    <field index="13" term="http://rs.tdwg.org/dwc/terms/reproductiveCondition"/>
    <field index="14" term="http://rs.tdwg.org/dwc/terms/behavior"/>
    <field index="15" term="http://rs.tdwg.org/dwc/terms/establishmentMeans"/>
    <field index="16" term="http://rs.tdwg.org/dwc/terms/occurrenceStatus"/>
    <field index="17" term="http://rs.tdwg.org/dwc/terms/preparations"/>
    <field index="18" term="http://rs.tdwg.org/dwc/terms/disposition"/>
    <field index="19" term="http://rs.tdwg.org/dwc/terms/otherCatalogNumbers"/>
    <field index="20" term="http://rs.tdwg.org/dwc/terms/previousIdentifications"/>
    <field index="21" term="http://rs.tdwg.org/dwc/terms/associatedMedia"/>
    <field index="22" term="http://rs.tdwg.org/dwc/terms/associatedReferences"/>
    <field index="23" term="http://rs.tdwg.org/dwc/terms/associatedOccurrences"/>
    <field index="24" term="http://rs.tdwg.org/dwc/terms/associatedSequences"/>
    <field index="25" term="http://rs.tdwg.org/dwc/terms/associatedTaxa"/>
    <field index="26" term="http://rs.tdwg.org/dwc/terms/eventID"/>
    <field index="27" term="http://rs.tdwg.org/dwc/terms/samplingProtocol"/>
    <field index="28" term="http://rs.tdwg.org/dwc/terms/samplingEffort"/>
    <field index="29" term="http://rs.tdwg.org/dwc/terms/eventDate"/>
    <field index="30" term="http://rs.tdwg.org/dwc/terms/eventTime"/>
    <field index="31" term="http://rs.tdwg.org/dwc/terms/startDayOfYear"/>
    <field index="32" term="http://rs.tdwg.org/dwc/terms/endDayOfYear"/>
    <field index="33" term="http://rs.tdwg.org/dwc/terms/year"/>
    <field index="34" term="http://rs.tdwg.org/dwc/terms/month"/>
    <field index="35" term="http://rs.tdwg.org/dwc/terms/day"/>
    <field index="36" term="http://rs.tdwg.org/dwc/terms/verbatimEventDate"/>
    <field index="37" term="http://rs.tdwg.org/dwc/terms/habitat"/>
    <field index="38" term="http://rs.tdwg.org/dwc/terms/fieldNumber"/>
    <field index="39" term="http://rs.tdwg.org/dwc/terms/fieldNotes"/>
    <field index="40" term="http://rs.tdwg.org/dwc/terms/eventRemarks"/>
    <field index="41" term="http://rs.tdwg.org/dwc/terms/locationID"/>
    <field index="42" term="http://rs.tdwg.org/dwc/terms/higherGeographyID"/>
    <field index="43" term="http://rs.tdwg.org/dwc/terms/higherGeography"/>
    <field index="44" term="http://rs.tdwg.org/dwc/terms/continent"/>
    <field index="45" term="http://rs.tdwg.org/dwc/terms/waterBody"/>
    <field index="46" term="http://rs.tdwg.org/dwc/terms/islandGroup"/>
    <field index="47" term="http://rs.tdwg.org/dwc/terms/island"/>
    <field index="48" term="http://rs.tdwg.org/dwc/terms/country"/>
    <field index="49" term="http://rs.tdwg.org/dwc/terms/countryCode"/>
    <field index="50" term="http://rs.tdwg.org/dwc/terms/stateProvince"/>
    <field index="51" term="http://rs.tdwg.org/dwc/terms/county"/>
    <field index="52" term="http://rs.tdwg.org/dwc/terms/municipality"/>
    <field index="53" term="http://rs.tdwg.org/dwc/terms/locality"/>
    <field index="54" term="http://rs.tdwg.org/dwc/terms/verbatimLocality"/>
    <field index="55" term="http://rs.tdwg.org/dwc/terms/verbatimElevation"/>
    <field index="56" term="http://rs.tdwg.org/dwc/terms/minimumElevationInMeters"/>
    <field index="57" term="http://rs.tdwg.org/dwc/terms/maximumElevationInMeters"/>
    <field index="58" term="http://rs.tdwg.org/dwc/terms/verbatimDepth"/>
    <field index="59" term="http://rs.tdwg.org/dwc/terms/minimumDepthInMeters"/>
    <field index="60" term="http://rs.tdwg.org/dwc/terms/maximumDepthInMeters"/>
    <field index="61" term="http://rs.tdwg.org/dwc/terms/minimumDistanceAboveSurfaceInMeters"/>
    <field index="62" term="http://rs.tdwg.org/dwc/terms/maximumDistanceAboveSurfaceInMeters"/>
    <field index="63" term="http://rs.tdwg.org/dwc/terms/locationAccordingTo"/>
    <field index="64" term="http://rs.tdwg.org/dwc/terms/locationRemarks"/>
    <field index="65" term="http://rs.tdwg.org/dwc/terms/verbatimCoordinates"/>
    <field index="66" term="http://rs.tdwg.org/dwc/terms/verbatimLatitude"/>
    <field index="67" term="http://rs.tdwg.org/dwc/terms/verbatimLongitude"/>
    <field index="68" term="http://rs.tdwg.org/dwc/terms/verbatimCoordinateSystem"/>
    <field index="69" term="http://rs.tdwg.org/dwc/terms/verbatimSRS"/>
    <field index="70" term="http://rs.tdwg.org/dwc/terms/decimalLatitude"/>
    <field index="71" term="http://rs.tdwg.org/dwc/terms/decimalLongitude"/>
    <field index="72" term="http://rs.tdwg.org/dwc/terms/geodeticDatum"/>
    <field index="73" term="http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters"/>
    <field index="74" term="http://rs.tdwg.org/dwc/terms/coordinatePrecision"/>
    <field index="75" term="http://rs.tdwg.org/dwc/terms/pointRadiusSpatialFit"/>
    <field index="76" term="http://rs.tdwg.org/dwc/terms/footprintWKT"/>
    <field index="77" term="http://rs.tdwg.org/dwc/terms/footprintSRS"/>
    <field index="78" term="http://rs.tdwg.org/dwc/terms/footprintSpatialFit"/>
    <field index="79" term="http://rs.tdwg.org/dwc/terms/georeferencedBy"/>
    <field index="80" term="http://rs.tdwg.org/dwc/terms/georeferenceProtocol"/>
    <field index="81" term="http://rs.tdwg.org/dwc/terms/georeferenceSources"/>
    <field index="82" term="http://rs.tdwg.org/dwc/terms/georeferenceRemarks"/>
    <field index="83" term="http://rs.tdwg.org/dwc/terms/identifiedBy"/>
    <field index="84" term="http://rs.tdwg.org/dwc/terms/typeStatus"/>
    <field index="85" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="86" term="http://rs.tdwg.org/dwc/terms/class"/>
    <field index="87" term="http://rs.tdwg.org/dwc/terms/order"/>
    <field index="88" term="http://rs.tdwg.org/dwc/terms/family"/>
    <field index="89" term="http://rs.tdwg.org/dwc/terms/genus"/>
    <field index="90" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>
  </core>
  <extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.gbif.org/terms/1.0/Multimedia">
    <files>
      <location>multimedia.txt</location>
    </files>
    <coreid index="0" />
    <field index="1" term="http://purl.org/dc/terms/type"/>
    <field index="2" term="http://purl.org/dc/terms/format"/>
    <field index="3" term="http://purl.org/dc/terms/identifier"/>
    <field index="4" term="http://purl.org/dc/terms/creator"/>
    <field index="5" term="http://purl.org/dc/terms/license"/>
  </extension>
  <extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.gbif.org/terms/1.0/Identifier">
    <files>
      <location>identifier.txt</location>
    </files>
    <coreid index="0" />
    <field index="1" term="http://purl.org/dc/terms/identifier"/>
    <field index="2" term="http://purl.org/dc/terms/title"/>
    <field index="3" term="http://purl.org/dc/terms/subject"/>
    <field index="4" term="http://purl.org/dc/terms/format"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/datasetID"/>
  </extension>
  <extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Identification">
    <files>
      <location>identification.txt</location>
    </files>
    <coreid index="0" />
    <field index="1" term="http://rs.tdwg.org/dwc/terms/identificationID"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/identifiedBy"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/dateIdentified"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/identificationReferences"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/identificationRemarks"/>
    <field index="6" term="http://rs.tdwg.org/dwc/terms/identificationQualifier"/>
    <field index="7" term="http://rs.tdwg.org/dwc/terms/identificationVerificationStatus"/>
    <field index="8" term="http://rs.tdwg.org/dwc/terms/typeStatus"/>
    <field index="9" term="http://rs.tdwg.org/dwc/terms/taxonID"/>
    <field index="10" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="11" term="http://rs.tdwg.org/dwc/terms/namePublishedIn"/>
    <field index="12" term="http://rs.tdwg.org/dwc/terms/namePublishedInYear"/>
    <field index="13" term="http://rs.tdwg.org/dwc/terms/nameAccordingTo"/>
    <field index="14" term="http://rs.tdwg.org/dwc/terms/acceptedNameUsage"/>
    <field index="15" term="http://rs.tdwg.org/dwc/terms/parentNameUsage"/>
    <field index="16" term="http://rs.tdwg.org/dwc/terms/originalNameUsage"/>
    <field index="17" term="http://rs.tdwg.org/dwc/terms/higherClassification"/>
    <field index="18" term="http://rs.tdwg.org/dwc/terms/kingdom"/>
    <field index="19" term="http://rs.tdwg.org/dwc/terms/phylum"/>
    <field index="20" term="http://rs.tdwg.org/dwc/terms/class"/>
    <field index="21" term="http://rs.tdwg.org/dwc/terms/order"/>
    <field index="22" term="http://rs.tdwg.org/dwc/terms/family"/>
    <field index="23" term="http://rs.tdwg.org/dwc/terms/genus"/>
    <field index="24" term="http://rs.tdwg.org/dwc/terms/subgenus"/>
    <field index="25" term="http://rs.tdwg.org/dwc/terms/specificEpithet"/>
    <field index="26" term="http://rs.tdwg.org/dwc/terms/infraspecificEpithet"/>
    <field index="27" term="http://rs.tdwg.org/dwc/terms/taxonRank"/>
    <field index="28" term="http://rs.tdwg.org/dwc/terms/verbatimTaxonRank"/>
    <field index="29" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>
    <field index="30" term="http://rs.tdwg.org/dwc/terms/vernacularName"/>
    <field index="31" term="http://rs.tdwg.org/dwc/terms/nomenclaturalCode"/>
    <field index="32" term="http://rs.tdwg.org/dwc/terms/taxonomicStatus"/>
    <field index="33" term="http://rs.tdwg.org/dwc/terms/nomenclaturalStatus"/>
    <field index="34" term="http://rs.tdwg.org/dwc/terms/taxonRemarks"/>
  </extension>
</archive>

@timrobertson100 thanks for sharing your concerns, hoping to update the script and redo the work. Please let me know if you have more suggestions and/or comments related to this workflow.

jhpoelen commented 1 year ago

@timrobertson100 I've updated the script and posted results in https://github.com/bio-guoda/preston/issues/168#issuecomment-1222502227 . Please re-open / comment if you have more hints to suggest that further improvements are needed.

jhpoelen commented 1 year ago

Note that with the results from updated scripts, the earlier referenced dataset

https://www.gbif.org/dataset/b5cdf794-8fa4-4a85-8b26-755d087bf531 or MNHN, Chagnoux S (2022). The vascular plants collection (P) at the Herbarium of the Muséum national d'Histoire Naturelle (MNHN - Paris). Version 69.272. MNHN - Museum national d'Histoire naturelle. Occurrence dataset https://doi.org/10.15468/nc6rxy accessed via GBIF.org on 2022-08-19.

is now included as desired:

$ grep "94bab7bcf80483c3fb97d2faa00b68e49ae9f3750694f9870d029d6d12157e9f" *_da74.tsv
content_da74.tsv:hash://sha256/94bab7bcf80483c3fb97d2faa00b68e49ae9f3750694f9870d029d6d12157e9f da74    2022-07-01
content-with-multimedia_da74.tsv:hash://sha256/94bab7bcf80483c3fb97d2faa00b68e49ae9f3750694f9870d029d6d12157e9f da74    2022-07-01
content-with-still-images-and-specimen_da74.tsv:hash://sha256/94bab7bcf80483c3fb97d2faa00b68e49ae9f3750694f9870d029d6d12157e9f  da74    2022-07-01
content-with-still-images_da74.tsv:hash://sha256/94bab7bcf80483c3fb97d2faa00b68e49ae9f3750694f9870d029d6d12157e9f   da74    2022-07-01
timrobertson100 commented 1 year ago

Thanks @jhpoelen

I don't have time to dig into this, but I'll leave this here in case one of us or someone else does in the future.

A good exploration might be to look at the datasets preston detects images for e.g. Plant specimens and compare that with the datasets GBIF.org reports as having images using this API call for the top 25 by record count

For any dataset that is not picked up in preston, explore how the data are published, and find out why.

jhpoelen commented 1 year ago

There's only one dataset listed in the GBIF API call for the top 25 record count of datasets having images , namely

https://www.gbif.org/dataset/f873ef66-231a-4ea3-bd4d-a16f182bf337 Academy of Natural Sciences (2022). Academy of Natural Sciences of Drexel University. Occurrence dataset https://doi.org/10.15468/s55f2k accessed via GBIF.org on 2022-08-23.

I found this via:

  1. getting the uniq sorted list of GBIF dataset ids from their api on 2022-08-23 (note that I don't know which specific versions of GBIF indexed datasets are used to generate this result)
    curl "https://api.gbif.org/v1/occurrence/search?basis_of_record=PRESERVED_SPECIMEN&media_type=StillImage&taxon_key=6&advanced=1&occurrence_status=present&facet=datasetKey&limit=0&facetLimit=25" | jq --raw-output .facets[].counts[].name\
    | sort\
    | uniq\
    > datasetIds.tsv

datasetIds.tsv contained:

b5cdf794-8fa4-4a85-8b26-755d087bf531
15f819bd-6612-4447-854b-14d12ee1022d
821cc27a-e3bb-4bc5-ac34-89ada245069d
d415c253-4d61-4459-9d25-4015b9084fb0
b740eaa0-0679-41dc-acb7-990d562dfa37
861e6afe-f762-11e1-a439-00145eb45e9a
902c8fe7-8f38-45b0-854e-c324fed36303
cd6e21c8-9e8a-493a-8a76-fbf7862069e5
cb9beff3-a185-486f-975a-732251444158
e45c7d91-81c6-4455-86e3-2965a5739b1f
40a5e8ac-3d50-4e03-849c-d52defe3ff6b
c8d12f8a-7e39-4e2c-92d7-825d590ad15b
202cdfcf-0eac-4696-9e48-4797c562ff41
90c853e6-56bd-480b-8e8f-6285c3f8d42b
f873ef66-231a-4ea3-bd4d-a16f182bf337
7bd65a7a-f762-11e1-a439-00145eb45e9a
27b4ff4b-29c3-4017-9c48-3750861392f7
7e380070-f762-11e1-a439-00145eb45e9a
bf2a4bf0-5f31-11de-b67e-b8a03c50a862
834c9918-f762-11e1-a439-00145eb45e9a
56e9c560-bd2a-11dd-b15e-b8a03c50a862
064508e2-255e-4d82-9f13-05d73476cc03
648e3756-5dff-462b-9323-41e3214d9c3c
cd096ec0-a8a0-4c65-92cc-48b00fb934ce
e42c4be9-dc6d-466b-8df2-60ac8c47fadd

then, find dataset ids in provenance log with short id da74 and compare them with the list provided by GBIF API search results:

diff <(preston cat --remote https://linker.bio hash://sha256/da7450941e7179c973a2fe1127718541bca6ccafe0e4e2bfb7f7ca9dbb7adb86 | grep -o -f datasetIds.tsv | uniq | sort | uniq) <(cat datasetIds.tsv | sort | uniq)

yielding:

> f873ef66-231a-4ea3-bd4d-a16f182bf337

which is the Drexel dataset (see attached screenshot). Note how the publication date is July 1st, 2022. Indicating that the Preston snapshot may have missed this, simply because the datasets was not yet added to the GBIF registry.

Screenshot from 2022-08-23 08-01-45

jhpoelen commented 1 year ago

Now, query associated dataset URLs as recorded in provenance log da47:

preston cat --remote https://linker.bio hash://sha256/da7450941e7179c973a2fe1127718541bca6ccafe0e4e2bfb7f7ca9dbb7adb86\
 | grep -f datasetIds.tsv\
 | grep hadMember\
 | grep -P "^<[0-9a-f-]+>"\
 | cut -d ' ' -f3\
 | tr -d '<'\
  | tr -d '>'\
 | sort\
 | uniq\
 > datasetURLs.tsv

yields:

http://biocase.kew.org/downloads/kdb01/Royal%20Botanic%20Gardens,%20Kew%20-%20Herbarium%20Specimens.DwCA.zip
http://collections.mnhn.fr/ipt/archive.do?r=mnhn-p
http://collections.mnhn.fr/ipt/archive.do?r=mpu
http://collections.mnhn.fr/ipt/eml.do?r=mnhn-p
http://collections.mnhn.fr/ipt/eml.do?r=mpu
http://data.nhm.ac.uk/resources/gbif_dwca.zip
http://data.rbge.org.uk/service/dwca/data/darwin_core.zip
http://ipt.mobot.org:8080/ipt/archive.do?r=tropicosspecimens
http://ipt.mobot.org:8080/ipt/eml.do?r=tropicosspecimens
http://ipt.tacc.utexas.edu/archive.do?r=prc
http://ipt.tacc.utexas.edu/eml.do?r=prc
https://api.biodiversitydata.nl/v2/specimen/dwca/getDataSet/botany
https://apm-ipt.br.fgov.be:8443/ipt/archive.do?r=botanical_collection
https://apm-ipt.br.fgov.be:8443/ipt/eml.do?r=botanical_collection
https://collections.nmnh.si.edu/ipt/archive.do?r=nmnh_extant_dwc-a
https://collections.nmnh.si.edu/ipt/eml.do?r=nmnh_extant_dwc-a
https://depo.msu.ru/ipt/archive.do?r=plants
https://depo.msu.ru/ipt/eml.do?r=plants
https://fmipt.fieldmuseum.org/ipt/archive.do?r=fmnh_seedplants
https://fmipt.fieldmuseum.org/ipt/eml.do?r=fmnh_seedplants
https://gbif.laji.fi/archives/HR.168.zip
https://gbif.laji.fi/eml/HR.168.zip
https://ipt.gbif.es/archive.do?r=ma-fanero
https://ipt.gbif.es/eml.do?r=ma-fanero
https://ipt.gbif.no/archive.do?r=o_vascular
https://ipt.gbif.no/eml.do?r=o_vascular
https://ipt.huh.harvard.edu/ipt/archive.do?r=huh_all_records
https://ipt.huh.harvard.edu/ipt/eml.do?r=huh_all_records
https://ipt.lsa.umich.edu/archive.do?r=umherb
https://ipt.lsa.umich.edu/eml.do?r=umherb
https://ipt.recolnat.org/archive.do?r=lyb
https://ipt.recolnat.org/eml.do?r=lyb
https://nansh.org/portal/content/dwca/VT_DwC-A.zip
https://sernecportal.org/portal/content/dwca/GA_DwC-A.zip
https://sernecportal.org/portal/content/dwca/LSU-VascularPlants_DwC-A.zip
https://sernecportal.org/portal/content/dwca/NCU-VascularPlants_DwC-A.zip
http://sweetgum.nybg.org:8080/ipt/archive.do?r=occurrences
http://sweetgum.nybg.org:8080/ipt/eml.do?r=occurrences
https://www.herbarien.uzh.ch/ipt/archive.do?r=herbaria-z-zt
https://www.herbarien.uzh.ch/ipt/eml.do?r=herbaria-z-zt

note that these include urls pointing to EML files.

jhpoelen commented 1 year ago

And the associated content ids are:

preston cat --remote https://linker.bio hash://sha256/da7450941e7179c973a2fe1127718541bca6ccafe0e4e2bfb7f7ca9dbb7adb86 | grep -f datasetURLs.tsv | grep hasVersion | grep -o -P "hash://sha256/[0-9a-f]+" | sort | uniq

yielding

hash://sha256/0c3033e46d4df9ae1a6394155c6d6ce259d2f06d311c548ed1a84e836ecb1983
hash://sha256/0c38a784574107da56a763faf9765fb08774f2af10ffbebb8cd9957e2ec04953
hash://sha256/1084b266be1d79983b311bece3d3a340c725d63f60c90818fbe2340502e1f276
hash://sha256/10d25708bb536ed27ff312929890bd672203b6c40d78b9638c4fab89c5fb8d89
hash://sha256/175d13c039388064a447abfeb90cf87268bce2f9a41e603eb4924e3dfabbd795
hash://sha256/179af2a4a0dea96c5a669415e21189dbb529215da58dd9c174d3dacb1cdbc362
hash://sha256/18fbbc51b507f7261ae77fafdd241d6a32faa6ae9f7f805bf231c17ea75798b3
hash://sha256/25371a107e48d30093da79cdd7897dca1df1552bcfe2965087fdb4e2ea1f3447
hash://sha256/288f711c26328387a1e9dd704ea1768780a8bd59b68d583083d229d8152a2aad
hash://sha256/2ed655300fe838b34e23527942015ed800052a8eb14ae50e57344e1d0ed5062b
hash://sha256/32c1ca4adf75843cf70269a92943b153587e27074dd48c59a6a028774da2b4ba
hash://sha256/33f47630a3e8b563f21031b75d93ecabdbcb61fe5ae45868e5d718e9c91dda43
hash://sha256/3704b79b1e43d77e89ca9aa87bd9350be5d296887a64e5604bdee4979192414c
hash://sha256/3a394684ab4b516569e8483d43b57131d14171083cb2979fd08c6455cb31cb9d
hash://sha256/42eefd25c4a5d065da50feb442496a02fdd460bfcd85ee7ecdba49e3cdcb58e7
hash://sha256/496955550260b198a75b471e31ed473d8845120a1b4331023e79e2219c32976f
hash://sha256/4ad840484e75ccc6c90ad13ed2bf354a9a2023d2309bcdb4ca80bb101da65d9f
hash://sha256/56b5d163e3cfe2041403bb8cb0a70b603db54fa52f1ce885593e9f27e7e1329b
hash://sha256/59da373a940136e9a327f7cba3736025cf67cbaf5e5cdf7937871d2fb55d3f6f
hash://sha256/5a1ba532aa56f6caeed0aebb28de45b1bbf2a5c32069f47505c705b4596cc835
hash://sha256/5e0726ae35d4af218c5cd9ff47b2d738b2c4d73f9d04cde84125f105d8d7fe97
hash://sha256/6395a8408a82cc7f6b50953f914790d3ae2b7f14fca6482782da8792073031d8
hash://sha256/6574583b58d31dee20660cb577a0d84b064db70afa0c8b44764435b4580ba209
hash://sha256/67b960e2ee85e86c5ba4455bb249d1992a4d73ec1ce6efd25ac3803ce1cb46d3
hash://sha256/6cff4c9ae21b6d94bb21a91b135f37368904f2aea4e005fc1974e9697df54b22
hash://sha256/6fe3ec927fa613ea7e0eff85ec93a2627a0fcea24ba910f1c6a3f8fc45eb756d
hash://sha256/78493bb2e75b213d48249a3dfb1343b14664c996ff0a2416eac73afdef0f6110
hash://sha256/7884b92bdc4f7f149850f7f43446ed9db4174e77d22b0e5298e51cb64e261aee
hash://sha256/7bf7ab6d0dde14ae492511ae53bc72eb4e14b6c98ef23f297edf9d7cd83c3557
hash://sha256/85704cb13b9348b9e5de3066ff7edaa88cae757627a49255d53e1e31d701425a
hash://sha256/88fcbdf471a0ff1c922377d0c4c288882dfa0e46ab83ac53a7e5900606ce6e01
hash://sha256/919f9670253636aa979d9a447eb4740ac512899f655c01308954dcf122e95e09
hash://sha256/94bab7bcf80483c3fb97d2faa00b68e49ae9f3750694f9870d029d6d12157e9f
hash://sha256/94e5ffb64a68bda0cdbd22616415f19839d5ea9410fa6b8ea90189373cd0daa0
hash://sha256/9a28612706a42861affdb2f2354e7f7fe9cf2cc0a47cae55c94f65b0448cf5cf
hash://sha256/a1a5706f0810051f9135148b139a7e5022931cb8fcb7f9e57f1881eff127a74c
hash://sha256/a44d056496bd447d2356d1e4a3bc48035c37f4a40822dde2a31ab9f9a57468f7
hash://sha256/a4685b7246eac48fdba48789ebd414fcc4559df8034df964cadbf9060b188fad
hash://sha256/a4862c68e4de13331ae5733065cd78429f8bb373b76da4fc5348b2b2913150df
hash://sha256/a53e28d147824947a2ffbc4b501caf372c9f17c3d5d92b3e8a4c3476b2a3b8ca
hash://sha256/a642c2230cb84d90b6c2c93981ca03774acadbcddc207563c7e49a2fd607d0f9
hash://sha256/af6f7f92ab03242214cf6b0ff8915fbdf43669ac09e846a0bef84b47f98f96d5
hash://sha256/b29010327ec2bdbc73b52d3d2fbf2c7682a157026976a899bb8fc26300ba7f76
hash://sha256/b775034faee1f9ca64f022de1119bb12999eb40cc195b5371fc49cee478f2dd9
hash://sha256/bb2f55ec2c12472aece8e3cdf56471d6a0356d5d47929bf603ed94edf4104f34
hash://sha256/c40b79578e9cfae5ae87fbda3437c3b24785241bafa936cae9df94dffa4f8261
hash://sha256/cf95d635bc714477fc0336004e5621e10db4b249def77294312db58866a86192
hash://sha256/d154aa55e318cf9db67110008232aa3e362d3585ef1ac267e2ad2c0cd85dddc3
hash://sha256/d2f52720e8035824a1c86bca72fd64cfdb1d3e5956d92eac16ec725eb3c62ca1
hash://sha256/d355f8d5fc32580a4929e9ed010c7160ae08ec0e92f7f9e62ded7ac607b90e85
hash://sha256/d6d0b6af7c547cfeb110b7d16a2aadcc2cea84ac474dda71ef2816c2925e623a
hash://sha256/df5a6d1a1281cf20adcbffde3f9de0712c1bfc943373f8c9151e3bc54e3937e9
hash://sha256/e094905c25b4e132095bf45169f2d3b917f5558322507fd4b2f72f6b067d9dd6
hash://sha256/e2d5f45998f302cbff122e48eb300537568dba4edf3aef06aba0ed6e9196f634
hash://sha256/e3db92bce8ff1738d6bdbfbf051526675f68f5ba116bcac63087a6a30bd74da9
hash://sha256/e8028a2f132734ff6d649a45f746b92fd8a958659cac120f0c66cf386a6b6b05
hash://sha256/e84da586a58cd9c84d0dffb89521b7f2e4c7217c9c933cc50f3bf106c3dc5e32
hash://sha256/ee2a2189f5cfc369c61a6dab84cc78c3412619a5744de937bb25438808b2ec90
hash://sha256/f1051f2767f556d86e71e7aa3bb018cbae384f12c96926a9fce28186225c2181
hash://sha256/f798f39aa4b47d1db285630226056938dfbbe248931c053a8739efec6b3f065c
hash://sha256/fe14cb7c59d299b82d1a13d7f1276cfea7e25514f3d3f0a0b46f08f952079967
jhpoelen commented 1 year ago

Out of which, the following 17 content ids were detected to have StillImage and PreservedSpecimen records:

hash://sha256/1084b266be1d79983b311bece3d3a340c725d63f60c90818fbe2340502e1f276
hash://sha256/179af2a4a0dea96c5a669415e21189dbb529215da58dd9c174d3dacb1cdbc362
hash://sha256/25371a107e48d30093da79cdd7897dca1df1552bcfe2965087fdb4e2ea1f3447
hash://sha256/33f47630a3e8b563f21031b75d93ecabdbcb61fe5ae45868e5d718e9c91dda43
hash://sha256/496955550260b198a75b471e31ed473d8845120a1b4331023e79e2219c32976f
hash://sha256/5a1ba532aa56f6caeed0aebb28de45b1bbf2a5c32069f47505c705b4596cc835
hash://sha256/5e0726ae35d4af218c5cd9ff47b2d738b2c4d73f9d04cde84125f105d8d7fe97
hash://sha256/6cff4c9ae21b6d94bb21a91b135f37368904f2aea4e005fc1974e9697df54b22
hash://sha256/78493bb2e75b213d48249a3dfb1343b14664c996ff0a2416eac73afdef0f6110
hash://sha256/85704cb13b9348b9e5de3066ff7edaa88cae757627a49255d53e1e31d701425a
hash://sha256/88fcbdf471a0ff1c922377d0c4c288882dfa0e46ab83ac53a7e5900606ce6e01
hash://sha256/94bab7bcf80483c3fb97d2faa00b68e49ae9f3750694f9870d029d6d12157e9f
hash://sha256/9a28612706a42861affdb2f2354e7f7fe9cf2cc0a47cae55c94f65b0448cf5cf
hash://sha256/a44d056496bd447d2356d1e4a3bc48035c37f4a40822dde2a31ab9f9a57468f7
hash://sha256/b775034faee1f9ca64f022de1119bb12999eb40cc195b5371fc49cee478f2dd9
hash://sha256/e3db92bce8ff1738d6bdbfbf051526675f68f5ba116bcac63087a6a30bd74da9
hash://sha256/f1051f2767f556d86e71e7aa3bb018cbae384f12c96926a9fce28186225c2181

This leaves 25 - 1 (the recently registered Drexel dataset) - 17 = 7 dataset unaccounted for compared with the GBIF top 25.

Here's a sample of hashes to be investigated for structure:

preston cat --remote https://linker.bio hash://sha256/da7450941e7179c973a2fe1127718541bca6ccafe0e4e2bfb7f7ca9dbb7adb86 | grep -f datasetURLs.tsv | grep hasVersion | grep -v -E -f datasetsTop25WithStillImageAndPreservedSpecimen.tsv  | grep -v "eml" | cut -d ' ' -f3 | sort | uniq -c | sort -nr
      3 <hash://sha256/bb2f55ec2c12472aece8e3cdf56471d6a0356d5d47929bf603ed94edf4104f34>
      3 <hash://sha256/94e5ffb64a68bda0cdbd22616415f19839d5ea9410fa6b8ea90189373cd0daa0>
      2 <hash://sha256/e8028a2f132734ff6d649a45f746b92fd8a958659cac120f0c66cf386a6b6b05>
      2 <hash://sha256/b29010327ec2bdbc73b52d3d2fbf2c7682a157026976a899bb8fc26300ba7f76>
      2 <hash://sha256/a53e28d147824947a2ffbc4b501caf372c9f17c3d5d92b3e8a4c3476b2a3b8ca>
      2 <hash://sha256/67b960e2ee85e86c5ba4455bb249d1992a4d73ec1ce6efd25ac3803ce1cb46d3>
      2 <hash://sha256/59da373a940136e9a327f7cba3736025cf67cbaf5e5cdf7937871d2fb55d3f6f>
      2 <hash://sha256/3a394684ab4b516569e8483d43b57131d14171083cb2979fd08c6455cb31cb9d>
      1 <hash://sha256/ee2a2189f5cfc369c61a6dab84cc78c3412619a5744de937bb25438808b2ec90>
      1 <hash://sha256/e84da586a58cd9c84d0dffb89521b7f2e4c7217c9c933cc50f3bf106c3dc5e32>
      1 <hash://sha256/e2d5f45998f302cbff122e48eb300537568dba4edf3aef06aba0ed6e9196f634>
      1 <hash://sha256/d355f8d5fc32580a4929e9ed010c7160ae08ec0e92f7f9e62ded7ac607b90e85>
      1 <hash://sha256/d2f52720e8035824a1c86bca72fd64cfdb1d3e5956d92eac16ec725eb3c62ca1>
      1 <hash://sha256/d154aa55e318cf9db67110008232aa3e362d3585ef1ac267e2ad2c0cd85dddc3>
      1 <hash://sha256/7bf7ab6d0dde14ae492511ae53bc72eb4e14b6c98ef23f297edf9d7cd83c3557>
      1 <hash://sha256/7884b92bdc4f7f149850f7f43446ed9db4174e77d22b0e5298e51cb64e261aee>
      1 <hash://sha256/56b5d163e3cfe2041403bb8cb0a70b603db54fa52f1ce885593e9f27e7e1329b>
      1 <hash://sha256/288f711c26328387a1e9dd704ea1768780a8bd59b68d583083d229d8152a2aad>
      1 <hash://sha256/0c38a784574107da56a763faf9765fb08774f2af10ffbebb8cd9957e2ec04953>
jhpoelen commented 1 year ago

One of the excluded content ids for dataset has meta.xml of:

$ preston cat 'zip:hash://sha256/bb2f55ec2c12472aece8e3cdf56471d6a0356d5d47929bf603ed94edf4104f34!/meta.xml'
<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
  <core encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
    <files>
      <location>occurrence.txt</location>
    </files>
    <id index="0" />
    <field index="1" term="http://purl.org/dc/terms/type"/>
    <field index="2" term="http://purl.org/dc/terms/modified"/>
    <field index="3" term="http://purl.org/dc/terms/language"/>
    <field index="4" term="http://purl.org/dc/terms/license"/>
    <field index="5" term="http://purl.org/dc/terms/rightsHolder"/>
    <field index="6" term="http://purl.org/dc/terms/accessRights"/>
    <field index="7" term="http://purl.org/dc/terms/bibliographicCitation"/>
    <field index="8" term="http://purl.org/dc/terms/references"/>
    <field index="9" term="http://rs.tdwg.org/dwc/terms/institutionID"/>
    <field index="10" term="http://rs.tdwg.org/dwc/terms/collectionID"/>
    <field index="11" term="http://rs.tdwg.org/dwc/terms/datasetID"/>
    <field index="12" term="http://rs.tdwg.org/dwc/terms/institutionCode"/>
    <field index="13" term="http://rs.tdwg.org/dwc/terms/collectionCode"/>
    <field index="14" term="http://rs.tdwg.org/dwc/terms/datasetName"/>
    <field index="15" term="http://rs.tdwg.org/dwc/terms/ownerInstitutionCode"/>
    <field index="16" term="http://rs.tdwg.org/dwc/terms/basisOfRecord"/>
    <field index="17" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
    <field index="18" term="http://rs.tdwg.org/dwc/terms/catalogNumber"/>
    <field index="19" term="http://rs.tdwg.org/dwc/terms/occurrenceRemarks"/>
    <field index="20" term="http://rs.tdwg.org/dwc/terms/recordNumber"/>
    <field index="21" term="http://rs.tdwg.org/dwc/terms/recordedBy"/>
    <field index="22" term="http://rs.tdwg.org/dwc/terms/individualCount"/>
    <field index="23" term="http://rs.tdwg.org/dwc/terms/sex"/>
    <field index="24" term="http://rs.tdwg.org/dwc/terms/lifeStage"/>
    <field index="25" term="http://rs.tdwg.org/dwc/terms/preparations"/>
    <field index="26" term="http://rs.tdwg.org/dwc/terms/otherCatalogNumbers"/>
    <field index="27" term="http://rs.tdwg.org/dwc/terms/associatedMedia"/>
    <field index="28" term="http://rs.tdwg.org/dwc/terms/associatedSequences"/>
    <field index="29" term="http://rs.tdwg.org/dwc/terms/organismID"/>
    <field index="30" term="http://rs.tdwg.org/dwc/terms/associatedOccurrences"/>
    <field index="31" term="http://rs.tdwg.org/dwc/terms/eventDate"/>
    <field index="32" term="http://rs.tdwg.org/dwc/terms/eventTime"/>
    <field index="33" term="http://rs.tdwg.org/dwc/terms/startDayOfYear"/>
    <field index="34" term="http://rs.tdwg.org/dwc/terms/endDayOfYear"/>
    <field index="35" term="http://rs.tdwg.org/dwc/terms/year"/>
    <field index="36" term="http://rs.tdwg.org/dwc/terms/month"/>
    <field index="37" term="http://rs.tdwg.org/dwc/terms/day"/>
    <field index="38" term="http://rs.tdwg.org/dwc/terms/verbatimEventDate"/>
    <field index="39" term="http://rs.tdwg.org/dwc/terms/fieldNumber"/>
    <field index="40" term="http://rs.tdwg.org/dwc/terms/fieldNotes"/>
    <field index="41" term="http://rs.tdwg.org/dwc/terms/locationID"/>
    <field index="42" term="http://rs.tdwg.org/dwc/terms/higherGeography"/>
    <field index="43" term="http://rs.tdwg.org/dwc/terms/continent"/>
    <field index="44" term="http://rs.tdwg.org/dwc/terms/waterBody"/>
    <field index="45" term="http://rs.tdwg.org/dwc/terms/islandGroup"/>
    <field index="46" term="http://rs.tdwg.org/dwc/terms/island"/>
    <field index="47" term="http://rs.tdwg.org/dwc/terms/country"/>
    <field index="48" term="http://rs.tdwg.org/dwc/terms/stateProvince"/>
    <field index="49" term="http://rs.tdwg.org/dwc/terms/county"/>
    <field index="50" term="http://rs.tdwg.org/dwc/terms/locality"/>
    <field index="51" term="http://rs.tdwg.org/dwc/terms/verbatimElevation"/>
    <field index="52" term="http://rs.tdwg.org/dwc/terms/minimumElevationInMeters"/>
    <field index="53" term="http://rs.tdwg.org/dwc/terms/maximumElevationInMeters"/>
    <field index="54" term="http://rs.tdwg.org/dwc/terms/verbatimDepth"/>
    <field index="55" term="http://rs.tdwg.org/dwc/terms/minimumDepthInMeters"/>
    <field index="56" term="http://rs.tdwg.org/dwc/terms/maximumDepthInMeters"/>
    <field index="57" term="http://rs.tdwg.org/dwc/terms/locationRemarks"/>
    <field index="58" term="http://rs.tdwg.org/dwc/terms/verbatimLatitude"/>
    <field index="59" term="http://rs.tdwg.org/dwc/terms/verbatimLongitude"/>
    <field index="60" term="http://rs.tdwg.org/dwc/terms/decimalLatitude"/>
    <field index="61" term="http://rs.tdwg.org/dwc/terms/decimalLongitude"/>
    <field index="62" term="http://rs.tdwg.org/dwc/terms/geodeticDatum"/>
    <field index="63" term="http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters"/>
    <field index="64" term="http://rs.tdwg.org/dwc/terms/coordinatePrecision"/>
    <field index="65" term="http://rs.tdwg.org/dwc/terms/georeferenceProtocol"/>
    <field index="66" term="http://rs.tdwg.org/dwc/terms/identifiedBy"/>
    <field index="67" term="http://rs.tdwg.org/dwc/terms/dateIdentified"/>
    <field index="68" term="http://rs.tdwg.org/dwc/terms/identificationQualifier"/>
    <field index="69" term="http://rs.tdwg.org/dwc/terms/typeStatus"/>
    <field index="70" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="71" term="http://rs.tdwg.org/dwc/terms/higherClassification"/>
    <field index="72" term="http://rs.tdwg.org/dwc/terms/kingdom"/>
    <field index="73" term="http://rs.tdwg.org/dwc/terms/phylum"/>
    <field index="74" term="http://rs.tdwg.org/dwc/terms/class"/>
    <field index="75" term="http://rs.tdwg.org/dwc/terms/order"/>
    <field index="76" term="http://rs.tdwg.org/dwc/terms/family"/>
    <field index="77" term="http://rs.tdwg.org/dwc/terms/genus"/>
    <field index="78" term="http://rs.tdwg.org/dwc/terms/subgenus"/>
    <field index="79" term="http://rs.tdwg.org/dwc/terms/specificEpithet"/>
    <field index="80" term="http://rs.tdwg.org/dwc/terms/infraspecificEpithet"/>
    <field index="81" term="http://rs.tdwg.org/dwc/terms/taxonRank"/>
    <field index="82" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>
    <field index="83" term="http://rs.tdwg.org/dwc/terms/nomenclaturalCode"/>
  </core>
  <extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/ac/terms/Multimedia">
    <files>
      <location>multimedia.txt</location>
    </files>
    <coreid index="0" />
    <field index="1" term="http://purl.org/dc/terms/identifier"/>
    <field index="2" term="http://purl.org/dc/elements/1.1/type"/>
    <field index="3" term="http://rs.tdwg.org/ac/terms/subtypeLiteral"/>
    <field index="4" term="http://purl.org/dc/terms/title"/>
    <field index="5" term="http://ns.adobe.com/xap/1.0/MetadataDate"/>
    <field index="6" term="http://rs.tdwg.org/ac/terms/metadataLanguageLiteral"/>
    <field index="7" term="http://rs.tdwg.org/ac/terms/providerManagedID"/>
    <field index="8" term="http://rs.tdwg.org/ac/terms/hasServiceAccessPoint"/>
    <field index="9" term="http://purl.org/dc/elements/1.1/rights"/>
    <field index="10" term="http://ns.adobe.com/xap/1.0/rights/Owner"/>
    <field index="11" term="http://ns.adobe.com/xap/1.0/rights/WebStatement"/>
    <field index="12" term="http://ns.adobe.com/photoshop/1.0/Credit"/>
    <field index="13" term="http://purl.org/dc/elements/1.1/creator"/>
    <field index="14" term="http://rs.tdwg.org/ac/terms/providerLiteral"/>
    <field index="15" term="http://purl.org/dc/terms/description"/>
    <field index="16" term="http://rs.tdwg.org/ac/terms/tag"/>
    <field index="17" term="http://ns.adobe.com/xap/1.0/CreateDate"/>
    <field index="18" term="http://rs.tdwg.org/ac/terms/IDofContainingCollection"/>
    <field index="19" term="http://rs.tdwg.org/ac/terms/accessURI"/>
    <field index="20" term="http://purl.org/dc/elements/1.1/format"/>
    <field index="21" term="http://rs.tdwg.org/ac/terms/variantLiteral"/>
    <field index="22" term="http://rs.tdwg.org/ac/terms/hashFunction"/>
    <field index="23" term="http://rs.tdwg.org/ac/terms/hashValue"/>
    <field index="24" term="http://ns.adobe.com/exif/1.0/PixelXDimension"/>
    <field index="25" term="http://ns.adobe.com/exif/1.0/PixelYDimension"/>
  </extension>
</archive>

Note that http://purl.org/dc/elements/1.1/type is used in the multimedia extension, instead of expected http://purl.org/dc/terms/type .

and in https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#type -

URI http://purl.org/dc/terms/type
Label Type
Definition The nature or genre of the resource.
Comment Recommended practice is to use a controlled vocabulary such as the DCMI Type Vocabulary [DCMI-TYPE]. To describe the file format, physical medium, or dimensions of the resource, use the property Format.
Type of Term Property
Subproperty of Type (http://purl.org/dc/elements/1.1/type)

@qgroom @timrobertson100 Note that GBIF's parser automatically replaces the one for the other:

WARN  [org.gbif.dwc.terms.TermFactory] - Terms dcterms:type and http://purl.org/dc/elements/1.1/type are both known as "type". Keeping only dcterms:type

https://github.com/gbif/dwc-api/blob/c79f1245ac15679e1526304934f58f7ac21158fe/src/main/java/org/gbif/dwc/terms/TermFactory.java#L160

A small redirect with a big impact. . .

jhpoelen commented 1 year ago

And

$ preston cat 'zip:hash://sha256/e8028a2f132734ff6d649a45f746b92fd8a958659cac120f0c66cf386a6b6b05!/multimedia.txt' | head -n2
id  identifier  MetadataDate    accessURI
d0487b32-22c9-4938-9355-2f00e61cd682    https://quod.lib.umich.edu/cgi/i/image/api/image/herb00ic:100000:MICH-F-100000/full/res:0/0/native.jpg  2022-05-16 07:11:56 https://quod.lib.umich.edu/cgi/i/image/api/image/herb00ic:100000:MICH-F-100000/full/res:0/0/native.jpg

doesn't type their multimedia as StillImage

jhpoelen commented 1 year ago

And some images are extracted from the associatedMedia occurrence property list. Again, not specified as StillImage.

$ preston cat 'zip:hash://sha256/b29010327ec2bdbc73b52d3d2fbf2c7682a157026976a899bb8fc26300ba7f76!/meta.xml' 
<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
  <core encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
    <files>
      <location>occurrence.txt</location>
    </files>
    <id index="0" />
    <field index="1" term="http://rs.tdwg.org/dwc/terms/institutionCode"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/collectionCode"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/basisOfRecord"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/catalogNumber"/>
    <field index="6" term="http://rs.tdwg.org/dwc/terms/occurrenceRemarks"/>
    <field index="7" delimitedBy="|" term="http://rs.tdwg.org/dwc/terms/recordedBy"/>
    <field index="8" delimitedBy="|" term="http://rs.tdwg.org/dwc/terms/preparations"/>
    <field index="9" delimitedBy="|" term="http://rs.tdwg.org/dwc/terms/associatedMedia"/>
    <field index="10" term="http://rs.tdwg.org/dwc/terms/eventDate"/>
    <field index="11" term="http://rs.tdwg.org/dwc/terms/year"/>
    <field index="12" term="http://rs.tdwg.org/dwc/terms/month"/>
    <field index="13" term="http://rs.tdwg.org/dwc/terms/day"/>
    <field index="14" term="http://rs.tdwg.org/dwc/terms/fieldNumber"/>
    <field index="15" term="http://rs.tdwg.org/dwc/terms/eventRemarks"/>
    <field index="16" delimitedBy="|" term="http://rs.tdwg.org/dwc/terms/higherGeography"/>
    <field index="17" term="http://rs.tdwg.org/dwc/terms/continent"/>
    <field index="18" term="http://rs.tdwg.org/dwc/terms/country"/>
    <field index="19" term="http://rs.tdwg.org/dwc/terms/stateProvince"/>
    <field index="20" term="http://rs.tdwg.org/dwc/terms/county"/>
    <field index="21" term="http://rs.tdwg.org/dwc/terms/locality"/>
    <field index="22" term="http://rs.tdwg.org/dwc/terms/decimalLatitude"/>
    <field index="23" term="http://rs.tdwg.org/dwc/terms/decimalLongitude"/>
    <field index="24" term="http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters"/>
    <field index="25" term="http://rs.tdwg.org/dwc/terms/georeferenceRemarks"/>
    <field index="26" delimitedBy="|" term="http://rs.tdwg.org/dwc/terms/identifiedBy"/>
    <field index="27" term="http://rs.tdwg.org/dwc/terms/dateIdentified"/>
    <field index="28" term="http://rs.tdwg.org/dwc/terms/identificationRemarks"/>
    <field index="29" delimitedBy="|" term="http://rs.tdwg.org/dwc/terms/typeStatus"/>
    <field index="30" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="31" term="http://rs.tdwg.org/dwc/terms/kingdom"/>
    <field index="32" term="http://rs.tdwg.org/dwc/terms/phylum"/>
    <field index="33" term="http://rs.tdwg.org/dwc/terms/class"/>
    <field index="34" term="http://rs.tdwg.org/dwc/terms/order"/>
    <field index="35" term="http://rs.tdwg.org/dwc/terms/family"/>
    <field index="36" term="http://rs.tdwg.org/dwc/terms/genus"/>
    <field index="37" term="http://rs.tdwg.org/dwc/terms/specificEpithet"/>
    <field index="38" term="http://rs.tdwg.org/dwc/terms/infraspecificEpithet"/>
    <field index="39" term="http://rs.tdwg.org/dwc/terms/taxonRank"/>
  </core>
</archive>

with example records:

preston cat 'zip:hash://sha256/b29010327ec2bdbc73b52d3d2fbf2c7682a157026976a899bb8fc26300ba7f76!/occurrence.txt' | grep http | head -n2
cb2216f9-94fa-43b6-b932-fffae1f013e6    TEX TEX PreservedSpecimen   cb2216f9-94fa-43b6-b932-fffae1f013e6    00000689        Charles S. Wallis   Sheet   http://was.tacc.utexas.edu/fileget?coll=TEX-LL&type=O&filename=sp64851890548675667169.att.jpg   1960-05-07  1960    5   7   8507        United States, Texas, Ochiltree North America   United States   Texas   Ochiltree   7.8 mi SE of Perryton on US 83.                 Turner, B. L.   2000            Nothocalais cuspidata   Plantae         Asterales   Asteraceae  Nothocalais cuspidata       Species
c45c57db-66c7-4003-9aa0-8e8af687765d    TEX TEX PreservedSpecimen   c45c57db-66c7-4003-9aa0-8e8af687765d    00000690        Charles S. Wallis   Sheet   http://was.tacc.utexas.edu/fileget?coll=TEX-LL&type=O&filename=sp63949271807247832370.att.jpg   1960-05-07  1960    5   7   8553    Prairie area.   United States, Texas, Roberts   North America   United States   Texas   Roberts 26 miles S of Perryton on Texas 70.                 Turner, B. L.   2000            Nothocalais cuspidata   Plantae         Asterales   Asteraceae  Nothocalais cuspidata       Species
jhpoelen commented 1 year ago

@timrobertson100 can you please point out where GBIF makes assumptions that certain media are of type StillImage even if no type information is provided?

timrobertson100 commented 1 year ago

Honestly, it's not a part of the codebase I know well since it was rewritten, but it will come from a variety of places depending on the format of the source, including (somewhere around here in the code):

From a quick look each of those looks to be generating ImageRecord instances among other things which are eventuallly normalised as MultimediaRecords with the type, somewhere around here.

I suspect your question is mainly about how GBIF is parsing the various input sources for images , and I think the answer would lie in the list of classes above (or somewhere around that are of the code).

jhpoelen commented 1 year ago

@timrobertson100 thanks for elaborating on the way that GBIF infers StillImage multimedia records type.

Because some are not explicitly type as StillImage, I would expect to mark the image type as "inferred" just like you do with other inferred or otherwise changed values that you encounter in dwc archives and friends (e.g., ABCD). Ideally, there'd also be a reference to why the type was inferred. Curious to hear your thoughts on that.