Closed valentinedwv closed 1 year ago
Works with just no headless
context:
cache: true
contextmaps:
- file: ./configs/schemaorg-current-https.jsonld
prefix: https://schema.org/
- file: ./configs/schemaorg-current-https.jsonld
prefix: http://schema.org/
gleaner:
mill: true
runid: runX
summon: true
millers:
graph: true
minio:
address: oss.geocodes-dev.earthcube.org
port: 443
ssl: true
accesskey: worldsbestaccesskey
secretkey: worldsbestsecretkey
bucket: citesting
sources:
- sourcetype: sitemap
name: geocodes_demo_datasets
logo: https://github.com/earthcube/GeoCODES-Metadata/metadata/OtherResources
url: https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/gh-pages/metadata/Dataset/sitemap.xml
headless: false
pid: https://www.earthcube.org/datasets/
propername: Geocodes Demo Datasets
domain: "0"
active: true
credentialsfile: ""
other: {}
headlesswait: 0
delay: 0
summoner:
after: ""
delay: null
headless: http://127.0.0.1:9222
mode: full
threads: 5
but with a headless there are issues;
context:
cache: true
contextmaps:
- file: ./configs/schemaorg-current-https.jsonld
prefix: https://schema.org/
- file: ./configs/schemaorg-current-https.jsonld
prefix: http://schema.org/
gleaner:
mill: true
runid: runX
summon: true
millers:
graph: true
minio:
address: oss.geocodes-dev.earthcube.org
port: 443
ssl: true
accesskey: worldsbestaccesskey
secretkey: worldsbestsecretkey
bucket: test3
sources:
- sourcetype: sitemap
name: opentopography
logo: https://opentopography.org/sites/opentopography.org/files/ot_transp_logo_2.png
url: https://opentopography.org/sitemap.xml
headless: false
pid: https://www.re3data.org/repository/r3d100010655
propername: OpenTopography
domain: http://www.opentopography.org/
active: true
credentialsfile: ""
other: {}
headlesswait: 0
delay: 0
- sourcetype: sitemap
name: magic
logo: http://mbobak-ofc.ncsa.illinois.edu/ext/ec/magic/MagIC.png
url: https://www2.earthref.org/MagIC/contributions.sitemap.xml
headless: true
pid: http://www.re3data.org/repository/r3d100011910
propername: Magnetics Information Consortium (MagIC)
domain: https://www.earthref.org/MagIC
active: true
credentialsfile: ""
other: {}
headlesswait: 0
delay: 0
- sourcetype: sitemap
name: earthchem
logo: http://www.earthchem.org/sites/default/files/files/EC_0-1.png
url: https://ecl.earthchem.org/sitemap.xml
headless: false
pid: https://www.re3data.org/repository/r3d100011538
propername: earthchem
domain: https://ecl.earthchem.org/home.php
active: false
credentialsfile: ""
other: {}
headlesswait: 0
delay: 0
- sourcetype: sitemap
name: geocodes_demo_datasets
logo: ""
url: https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/gh-pages/metadata/Dataset/sitemap.xml
headless: false
pid: https://github.com/earthcube/GeoCODES-Metadata/metadata/OtherResources
propername: Geocodes Demo Datasets
domain: https://www.earthcube.org/datasets/
active: true
credentialsfile: ""
other: {}
headlesswait: 0
delay: 0
summoner:
after: ""
delay: null
headless: http://127.0.0.1:9222
mode: full
threads: 5
Hi, sorry it has taken me so long to look at this - I was on vacation, and I've also been ill. With your first config that you posted there, if you try indexing that particular source with headless: true
, do you see the same issue?
If all sources have headless:false works If one source has headless:true there may be an issue. Have not tried just a couple sources with headless:true.
I'm not very familiar with glcon, so I tried running command-line gleaner with your second config there, and I can reproduce the problem. It seems to happen with headless: false
data sources that appear after headless: true
ones in a config, which is an interesting clue.
I'm able to reproduce this with just one headless data source in my config. What seems to be happening is that the json-ld is fetched correctly (it has the correct size and dumping it to the console produces the correct output), it is process correctly if any fixups need to occur, and when Minio goes to write it to the bucket... it somehow doesn't get written, even though PutObject
doesn't return an error.
Here's the config yaml that I'm using. It's pretty minimal and straightforward, and I don't understand why the headless path is working fine and even using the same Upload helper method.
Could something be weird going on with the way threading is done differently between the two or something like that?
@fils have you ever seen anything like this before?
thanks, should give us akk place to look.
glcon just wraps up the calls to gleaner, nabu and configuration tools in one command line package.
rather than
gleaner
glcon gleaner
with a few additional features.
Update: I commented out the waitgroups and semaphores in acquire.go (GetDomain and ResRetrieve) to make this single-threaded, and... there is still nothing in my summoned
directory. So we can rule out threading, I think.
I just realized I forgot to post my config yaml upthread. Here it is if anyone is interested.
minio:
address: 0.0.0.0
port: 9000
ssl: false
bucket: gleaner
accessKey: worldsbestaccesskey
secretKey: worldsbestsecretkey
gleaner:
runid: polder # Run ID used in prov and a few others
summon: true # do we want to visit the web sites and pull down the files
mill: true
context:
cache: true
contextmaps:
- prefix: "https://schema.org/"
file: "./schemaorg-current-https.jsonld"
- prefix: "http://schema.org/"
file: "./schemaorg-current-https.jsonld"
- prefix: "http://schema.org/"
file: "./schemaorg-current-https.jsonld"
summoner:
after: "" # "21 May 20 10:00 UTC"
mode: full # full || diff: If diff compare what we have currently in gleaner to sitemap, get only new, delete missing
threads: 5
delay: 0 # milliseconds (1000 = 1 second) to delay between calls (will FORCE threads to 1)
headless: http://127.0.0.1:9222 # URL for headless see docs/headless
millers:
graph: true
sources:
# - name: gem
# sourcetype: sitemap
# headless: true
# url: https://data.g-e-m.dk/sitemap
# properName: Greenland Ecosystem Monitoring Database
# domain: https://data.g-e-m.dk
# active: false
- name : CCHDO
url: https://cchdo.ucsd.edu/sitemap.xml
sourcetype: sitemap
headless: false
properName: CLIVAR and Carbon Hydrographic Data Office
domain: https://cchdo.ucsd.edu
active: true
I think I may have a fix for this - I'm testing it with @valentinedwv 's config above as well as mine. Stay tuned!
Opened https://github.com/gleanerio/gleaner/pull/101 with a fix.
A bad url means that repos after that do not get added to the url list.
Is something borked with the latest. Only headless were harvested
Ran glcon and got a bunch of these is the logs, with no summoned files