gleanerio / gleaner

Gleaner: JSON-LD and structured data on the web harvesting
https://gleaner.io
Apache License 2.0
16 stars 10 forks source link

Issue with Sitemaps #98

Closed valentinedwv closed 1 year ago

valentinedwv commented 2 years ago

Is something borked with the latest. Only headless were harvested

Ran glcon and got a bunch of these is the logs, with no summoned files

ubuntu@geocodes-dev:~/indexing$ grep unavco gleaner-2022-07-28-17-34-18.log
{"file":"/github/workspace/internal/summoner/acquire/resources.go:189","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.getRobotsForDomain","level":"info","msg":"Getting robots.txt from http://www.unavco.org//robots.txt","time":"2022-07-28T17:37:31Z"}
{"file":"/github/workspace/internal/summoner/acquire/resources.go:127","func":"github.com/gleanerio/gleaner/internal/summoner/acquire.getSitemapURLList","level":"info","msg":"https://www.unavco.org/data/doi/sitemap.xml is not a sitemap index, checking to see if it is a sitemap","time":"2022-07-28T17:37:32Z"}
{"file":"/github/workspace/internal/millers/millers.go:44","func":"github.com/gleanerio/gleaner/internal/millers.Millers","level":"info","msg":"Adding bucket to milling list:summoned/unavco","time":"2022-07-30T20:09:35Z"}
{"file":"/github/workspace/internal/millers/millers.go:55","func":"github.com/gleanerio/gleaner/internal/millers.Millers","level":"info","msg":"Adding bucket to prov building list:prov/unavco","time":"2022-07-30T20:09:35Z"}
{"file":"/github/workspace/internal/millers/graph/graphng.go:82","func":"github.com/gleanerio/gleaner/internal/millers/graph.GraphNG","level":"info","msg":"Assembling result graph for prefix:summoned/unavcoto:milled/unavco","time":"2022-07-31T00:20:32Z"}
{"file":"/github/workspace/internal/millers/graph/graphng.go:83","func":"github.com/gleanerio/gleaner/internal/millers/graph.GraphNG","level":"info","msg":"Result graph will be at:results/runX/unavco_graph.nq","time":"2022-07-31T00:20:32Z"}
valentinedwv commented 2 years ago

Works with just no headless

context:
  cache: true
contextmaps:
- file: ./configs/schemaorg-current-https.jsonld
  prefix: https://schema.org/
- file: ./configs/schemaorg-current-https.jsonld
  prefix: http://schema.org/
gleaner:
  mill: true
  runid: runX
  summon: true
millers:
  graph: true
minio:
  address: oss.geocodes-dev.earthcube.org
  port: 443
  ssl: true
  accesskey: worldsbestaccesskey
  secretkey: worldsbestsecretkey
  bucket: citesting
sources:
- sourcetype: sitemap
  name: geocodes_demo_datasets
  logo: https://github.com/earthcube/GeoCODES-Metadata/metadata/OtherResources
  url: https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/gh-pages/metadata/Dataset/sitemap.xml
  headless: false
  pid: https://www.earthcube.org/datasets/
  propername: Geocodes Demo Datasets
  domain: "0"
  active: true
  credentialsfile: ""
  other: {}
  headlesswait: 0
  delay: 0
summoner:
  after: ""
  delay: null
  headless: http://127.0.0.1:9222
  mode: full
  threads: 5

but with a headless there are issues;

context:
  cache: true
contextmaps:
- file: ./configs/schemaorg-current-https.jsonld
  prefix: https://schema.org/
- file: ./configs/schemaorg-current-https.jsonld
  prefix: http://schema.org/
gleaner:
  mill: true
  runid: runX
  summon: true
millers:
  graph: true
minio:
  address: oss.geocodes-dev.earthcube.org
  port: 443
  ssl: true
  accesskey: worldsbestaccesskey
  secretkey: worldsbestsecretkey
  bucket: test3
sources:
- sourcetype: sitemap
  name: opentopography
  logo: https://opentopography.org/sites/opentopography.org/files/ot_transp_logo_2.png
  url: https://opentopography.org/sitemap.xml
  headless: false
  pid: https://www.re3data.org/repository/r3d100010655
  propername: OpenTopography
  domain: http://www.opentopography.org/
  active: true
  credentialsfile: ""
  other: {}
  headlesswait: 0
  delay: 0
- sourcetype: sitemap
  name: magic
  logo: http://mbobak-ofc.ncsa.illinois.edu/ext/ec/magic/MagIC.png
  url: https://www2.earthref.org/MagIC/contributions.sitemap.xml
  headless: true
  pid: http://www.re3data.org/repository/r3d100011910
  propername: Magnetics Information Consortium (MagIC)
  domain: https://www.earthref.org/MagIC
  active: true
  credentialsfile: ""
  other: {}
  headlesswait: 0
  delay: 0
- sourcetype: sitemap
  name: earthchem
  logo: http://www.earthchem.org/sites/default/files/files/EC_0-1.png
  url: https://ecl.earthchem.org/sitemap.xml
  headless: false
  pid: https://www.re3data.org/repository/r3d100011538
  propername: earthchem
  domain: https://ecl.earthchem.org/home.php
  active: false
  credentialsfile: ""
  other: {}
  headlesswait: 0
  delay: 0
- sourcetype: sitemap
  name: geocodes_demo_datasets
  logo: ""
  url: https://raw.githubusercontent.com/earthcube/GeoCODES-Metadata/gh-pages/metadata/Dataset/sitemap.xml
  headless: false
  pid: https://github.com/earthcube/GeoCODES-Metadata/metadata/OtherResources
  propername: Geocodes Demo Datasets
  domain: https://www.earthcube.org/datasets/
  active: true
  credentialsfile: ""
  other: {}
  headlesswait: 0
  delay: 0
summoner:
  after: ""
  delay: null
  headless: http://127.0.0.1:9222
  mode: full
  threads: 5
nein09 commented 2 years ago

Hi, sorry it has taken me so long to look at this - I was on vacation, and I've also been ill. With your first config that you posted there, if you try indexing that particular source with headless: true, do you see the same issue?

valentinedwv commented 2 years ago

If all sources have headless:false works If one source has headless:true there may be an issue. Have not tried just a couple sources with headless:true.

nein09 commented 2 years ago

I'm not very familiar with glcon, so I tried running command-line gleaner with your second config there, and I can reproduce the problem. It seems to happen with headless: false data sources that appear after headless: true ones in a config, which is an interesting clue.

nein09 commented 2 years ago

I'm able to reproduce this with just one headless data source in my config. What seems to be happening is that the json-ld is fetched correctly (it has the correct size and dumping it to the console produces the correct output), it is process correctly if any fixups need to occur, and when Minio goes to write it to the bucket... it somehow doesn't get written, even though PutObject doesn't return an error.

Here's the config yaml that I'm using. It's pretty minimal and straightforward, and I don't understand why the headless path is working fine and even using the same Upload helper method.

Could something be weird going on with the way threading is done differently between the two or something like that?

@fils have you ever seen anything like this before?

valentinedwv commented 2 years ago

thanks, should give us akk place to look.

glcon just wraps up the calls to gleaner, nabu and configuration tools in one command line package. rather than gleaner glcon gleaner with a few additional features.

nein09 commented 2 years ago

Update: I commented out the waitgroups and semaphores in acquire.go (GetDomain and ResRetrieve) to make this single-threaded, and... there is still nothing in my summoned directory. So we can rule out threading, I think.

nein09 commented 2 years ago

I just realized I forgot to post my config yaml upthread. Here it is if anyone is interested.

minio:
  address: 0.0.0.0
  port: 9000   
  ssl: false
  bucket: gleaner
  accessKey: worldsbestaccesskey
  secretKey: worldsbestsecretkey
gleaner:
  runid: polder # Run ID used in prov and a few others
  summon: true # do we want to visit the web sites and pull down the files
  mill: true
context:
  cache: true
contextmaps:
- prefix: "https://schema.org/"
  file: "./schemaorg-current-https.jsonld"
- prefix: "http://schema.org/"
  file: "./schemaorg-current-https.jsonld"
- prefix: "http://schema.org/"
  file: "./schemaorg-current-https.jsonld"
summoner:
  after: ""      # "21 May 20 10:00 UTC"   
  mode: full  # full || diff:  If diff compare what we have currently in gleaner to sitemap, get only new, delete missing
  threads: 5
  delay: 0 # milliseconds (1000 = 1 second) to delay between calls (will FORCE threads to 1)
  headless:  http://127.0.0.1:9222  # URL for headless see docs/headless
millers:
  graph: true
sources:
# - name: gem
#   sourcetype: sitemap
#   headless: true
#   url: https://data.g-e-m.dk/sitemap
#   properName: Greenland Ecosystem Monitoring Database
#   domain: https://data.g-e-m.dk
#   active: false
- name : CCHDO
  url: https://cchdo.ucsd.edu/sitemap.xml
  sourcetype: sitemap
  headless: false
  properName: CLIVAR and Carbon Hydrographic Data Office
  domain: https://cchdo.ucsd.edu
  active: true
nein09 commented 2 years ago

I think I may have a fix for this - I'm testing it with @valentinedwv 's config above as well as mine. Stay tuned!

nein09 commented 2 years ago

Opened https://github.com/gleanerio/gleaner/pull/101 with a fix.

valentinedwv commented 1 year ago

118 #119 is another issue.

valentinedwv commented 1 year ago

A bad url means that repos after that do not get added to the url list.

valentinedwv commented 1 year ago

119