iobis / obis-issues

Repository for all OBIS related issues and feature requests
5 stars 3 forks source link

Incorrect character sequences and invalid characters #189

Open BobSimons opened 3 years ago

BobSimons commented 3 years ago

In the 2021-05-18 occurrence.csv file, there are a large number of probably incorrect character sequences and invalid characters. I'll just guess that these stem from the characterSet being incorrectly set/handled when you imported the data from the source. Here are 3 examples from when scientificName="Ablennes hians":

  1. "IRD, UMR EME, Sète, France"
  2. "urn:catalog:IRD, UMR EME, Sète, France:DCR (IRD):#1314959243812#0.29551833381597215:0x0517180001000500"
  3. �les salomon

The data for many other scientificNames has similar problems. Can you please track down and fix these data problems? Thank you.

pieterprovoost commented 3 years ago

Thanks. Just adding a note here that the encoding issues listed here are present in the UTF-8 encoded data files on IPT, so the issue will need to be fixed at the data provider or node level. The respective nodes have been notified and I'll try to come up with a report to help the node managers identify issues.