gleanerio / gleaner

Gleaner: JSON-LD and structured data on the web harvesting
https://gleaner.io
Apache License 2.0
15 stars 10 forks source link

Provide option to skip validation #15

Open fils opened 4 years ago

fils commented 4 years ago

Tasks is to scrape for biomedical markup data, the data extends the schema.org schema and therefore we don't need the validation that Gleaner does.

fils commented 4 years ago

There are a few validation locations.

One is in acquire: acquire.go starting at line 110.

This simply checks if the JSON-LD is well formed. It has to be well formed to be usable. You can check your JSON-LD using https://json-ld.org/playground/ Simply put some example JSON-LD in there to ensure it is well formed JSON-LD.

This validation function (at line 177) checks 2 things actually 1) is the JSON-LD well formed such that it can be unmarshalled. 2) can the JSON-LD make proper RDF.. ie, for example, the IRIs formed by the JSON-LD are valid RDF.

Can you send any error you are getting such that I can see if this is the issue? If it is, I am fine putting in a flag for this in the config file and making this optional. Though really your JSON-ld should make it past this point to be usable by most JSON-LD tooling.

fils commented 4 years ago

The use case URL is: https://www.ebi.ac.uk/biosamples/samples/SAMEA4088955

One key issue here is that Gleaner is looking for http://schema.org/DataSet and this JSON-LD doesn't have that. It is using "DataRecord" which is likely in the OBI or biosample namespace. However, the default context is set to schema.org and the @type is DataRecord. There is no such type in schema.org.

The URL is below. Note this error will also mean these will not show up in the Google Data Set search site either.

We have some guidance on type dataset at: https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md

https://search.google.com/structured-data/testing-tool#url=https%3A%2F%2Fwww.ebi.ac.uk%2Fbiosamples%2Fsamples%2FSAMEA4088955

petrospaps commented 4 years ago

The biosample markup extends the schema from schema.org (not part of the current specification). More information can be found at: https://bioschemas.org

In an effort to scrape: https://www.ebi.ac.uk/biosamples/sitemap/599 Gleaner thrown the error: ebibiosamples 2m45s [--------------------------------------------------------------------] 100% panic: runtime error: index out of range [0] with length 0

goroutine 41 [running]: earthcube.org/Project418/gleaner/vendor/github.com/gosuri/uiprogress.(Bar).Bytes(0xc00087c300, 0xc000066f60, 0xc00054bf10, 0x1) /home/fils/src/go/src/earthcube.org/Project418/gleaner/vendor/github.com/gosuri/uiprogress/bar.go:195 +0x51f earthcube.org/Project418/gleaner/vendor/github.com/gosuri/uiprogress.(Bar).String(...) /home/fils/src/go/src/earthcube.org/Project418/gleaner/vendor/github.com/gosuri/uiprogress/bar.go:214 earthcube.org/Project418/gleaner/vendor/github.com/gosuri/uiprogress.(Progress).print(0xc000066fc0) /home/fils/src/go/src/earthcube.org/Project418/gleaner/vendor/github.com/gosuri/uiprogress/progress.go:127 +0xa5 earthcube.org/Project418/gleaner/vendor/github.com/gosuri/uiprogress.(Progress).Listen(0xc000066fc0) /home/fils/src/go/src/earthcube.org/Project418/gleaner/vendor/github.com/gosuri/uiprogress/progress.go:114 +0x49 created by earthcube.org/Project418/gleaner/vendor/github.com/gosuri/uiprogress.(*Progress).Start /home/fils/src/go/src/earthcube.org/Project418/gleaner/vendor/github.com/gosuri/uiprogress/progress.go:134 +0x46

Is this relating to the validation with JSONLD or is it a different problem?

The run configuration is: gleaner summon: true mill: true

millers graph: true shacl: false

OS is windows 10

Thank you