rseng / rse

tools for assessment and categorization of research software
https://rseng.github.io/rse/
Mozilla Public License 2.0
15 stars 2 forks source link

`ERROR:rse.utils.urls:Cannot find endpoint` when running `rse import --type google-sheets` #76

Closed NickleDave closed 2 years ago

NickleDave commented 2 years ago

Description

I now have my copy of the Google sheet from @rhine3 set up so that I can start import running like so:

rse import --type google-sheet "https://docs.google.com/spreadsheets/d/e/2PACX-1vQkPsu14BG0bErrY0thXymfS55be0spEVX_WpWm2Yy3We8swMO0sIb3iD4Sg-i1lWnxSsiiN5JmWAD-/pub?gid=0&single=true&output=csv"

that is, I don't get any errors about missing fields, i.e. incorrect column names, now that I removed the message in the hidden row 1, and renamed the first three columns

full sheet is here: https://docs.google.com/spreadsheets/d/1Ba1MY4o5Sm1f08IekJcbxAtSjkDN71Z1RZ42kzrofJ0/edit?usp=sharing

However I do get an error now that I'm not sure how to fix, traceback below

Two notes:

What I Did

$ rse import --type google-sheet "https://docs.google.com/spreadsheets/d/e/2PACX-1vQkPsu14BG0bErrY0thXymfS55be0spEVX_WpWm2Yy3We8swMO0sIb3iD4Sg-i1lWnxSsiiN5JmWAD-/pub?gid=0&single=true&output=csv"
INFO:rse.main.import.google-sheet:Found software record: https://github.com/patriceguyot/Acoustic_Indices
INFO:rse.main.import.google-sheet:Found software record: https://www.adobe.com/products/audition.html
INFO:rse.main.import.google-sheet:Found software record: https://www.titley-scientific.com/us/anabat-insight.html
INFO:rse.main.import.google-sheet:Found software record: https://datadryad.org/stash/dataset/doi:10.5061/dryad.221mq23
INFO:rse.main.import.google-sheet:Found software record: https://github.com/ChristianBergler/ANIMAL-SPOT
INFO:rse.main.import.google-sheet:Found software record: https://arbimon.rfcx.org/
INFO:rse.main.import.google-sheet:Found software record: https://soundanalysis.wp.st-andrews.ac.uk/
INFO:rse.main.import.google-sheet:Found software record: https://www.audacityteam.org/download/
INFO:rse.main.import.google-sheet:Found software record: https://github.com/nwolek/audiomoth-scripts
INFO:rse.main.import.google-sheet:Found software record: https://github.com/sarabsethi/audioset_soundscape_feats_sethi2019/tree/master/calc_audioset_feats
INFO:rse.main.import.google-sheet:Found software record: https://autoencoded-vocal-analysis.readthedocs.io/en/latest/index.html
INFO:rse.main.import.google-sheet:Found software record: https://github.com/timsainb/AVGN
INFO:rse.main.import.google-sheet:Found software record: http://www.avianz.net/index.php
INFO:rse.main.import.google-sheet:Found software record: http://www.avisoft.com/sound-analysis/
INFO:rse.main.import.google-sheet:Found software record: https://github.com/EricArcher/banter
INFO:rse.main.import.google-sheet:Found software record: https://bitbucket.org/chrisscott/batclassify/src
INFO:rse.main.import.google-sheet:Found software record: https://github.com/macaodha/batdetect
INFO:rse.main.import.google-sheet:Found software record: https://www.batlogger.com/en/products/batexplorer/
INFO:rse.main.import.google-sheet:Found software record: https://www.wsl.ch/en/services-and-products/software-websites-and-apps/batscope-4.html
INFO:rse.main.import.google-sheet:Found software record: https://cran.r-project.org/web/packages/bioacoustics/index.html
INFO:rse.main.import.google-sheet:Found software record: https://birdnet.cornell.edu/
INFO:rse.main.import.google-sheet:Found software record: https://github.com/BirdVox/birdvoxclassify
INFO:rse.main.import.google-sheet:Found software record: https://github.com/BirdVox/birdvoxdetect
INFO:rse.main.import.google-sheet:Found software record: https://github.com/OpenWild/caracal
INFO:rse.main.import.google-sheet:Found software record: https://github.com/vocalpy/crowsetta
INFO:rse.main.import.google-sheet:Found software record: https://github.com/MarineBioAcousticsRC/DetEdit
INFO:rse.main.import.google-sheet:Found software record: https://github.com/DrCoffey/DeepSqueak
INFO:rse.main.import.google-sheet:Found software record: https://github.com/nilomr/fieldtools
INFO:rse.main.import.google-sheet:Found software record: https://github.com/DenaJGibbon/gibbonR-package
INFO:rse.main.import.google-sheet:Found software record: http://www.oldbird.org/glassofire.htm
INFO:rse.main.import.google-sheet:Found software record: https://www.goldwave.com/
INFO:rse.main.import.google-sheet:Found software record: https://github.com/Cdevenish/hardRain
INFO:rse.main.import.google-sheet:Found software record: https://sites.google.com/view/alcore-suzuki/home/harkbird
INFO:rse.main.import.google-sheet:Found software record: https://github.com/vocalpy/hybrid-vocal-classifier
INFO:rse.main.import.google-sheet:Found software record: https://github.com/DanWoodrich/INSTINCT
INFO:rse.main.import.google-sheet:Found software record: http://bioacoustics.us/ishmael.html
INFO:rse.main.import.google-sheet:Found software record: https://www.wildlifeacoustics.com/products/kaleidoscope-pro
INFO:rse.main.import.google-sheet:Found software record: https://meridian.cs.dal.ca/2015/04/12/ketos/
INFO:rse.main.import.google-sheet:Found software record: https://koe.io.ac.nz/
INFO:rse.main.import.google-sheet:Found software record: https://github.com/shyamblast/Koogu/tree/v0.6.5
INFO:rse.main.import.google-sheet:Found software record: https://librosa.org/librosa/
INFO:rse.main.import.google-sheet:Found software record: https://rflachlan.github.io/Luscinia/
INFO:rse.main.import.google-sheet:Found software record: https://cran.r-project.org/web/packages/monitoR/index.html
INFO:rse.main.import.google-sheet:Found software record: https://marce10.github.io/ohun/index.html
INFO:rse.main.import.google-sheet:Found software record: https://github.com/kitzeslab/opensoundscape
INFO:rse.main.import.google-sheet:Found software record: https://www.pamguard.org/
INFO:rse.main.import.google-sheet:Found software record: https://github.com/TaikiSan21/PAMr
INFO:rse.main.import.google-sheet:Found software record: https://github.com/YannickJadoul/Parselmouth
INFO:rse.main.import.google-sheet:Found software record: https://www.fon.hum.uva.nl/praat/
INFO:rse.main.import.google-sheet:Found software record: https://github.com/shivChitinous/prinia-project
INFO:rse.main.import.google-sheet:Found software record: https://ravensoundsoftware.com/software/raven-lite/
INFO:rse.main.import.google-sheet:Found software record: https://ravensoundsoftware.com/software/raven-pro
INFO:rse.main.import.google-sheet:Found software record: https://www.reaper.fm/
INFO:rse.main.import.google-sheet:Found software record: https://github.com/scikit-maad/scikit-maad
INFO:rse.main.import.google-sheet:Found software record: https://docs.scipy.org/doc/scipy/reference/signal.html
INFO:rse.main.import.google-sheet:Found software record: http://dx.doi.org/10.6084/m9.figshare.3792780
INFO:rse.main.import.google-sheet:Found software record: https://cran.r-project.org/web/packages/seewave/index.html
INFO:rse.main.import.google-sheet:Found software record: https://www.sonicvisualiser.org/
INFO:rse.main.import.google-sheet:Found software record: https://sonobat.com/
INFO:rse.main.import.google-sheet:Found software record: https://doi.org/10.1080/09524622.2013.827588
INFO:rse.main.import.google-sheet:Found software record: https://soundata.readthedocs.io/en/latest/
INFO:rse.main.import.google-sheet:Found software record: https://cran.r-project.org/web/packages/soundecology/vignettes/intro.html
INFO:rse.main.import.google-sheet:Found software record: https://github.com/macster110/aipam
INFO:rse.main.import.google-sheet:Found software record: https://github.com/rhine3/specky
INFO:rse.main.import.google-sheet:Found software record: https://github.com/YvesBas/Tadarida-L

https://github.com/YvesBas/Tadarida-D

https://github.com/YvesBas/Tadarida-C
INFO:rse.main.import.google-sheet:Found software record: https://www.cetus.ucsd.edu/technologies_triton.html
INFO:rse.main.import.google-sheet:Found software record: https://github.com/yardencsGitHub/tweetynet
INFO:rse.main.import.google-sheet:Found software record: https://github.com/vocalpy/vak
INFO:rse.main.import.google-sheet:Found software record: https://github.com/HaroldMills/Vesper
INFO:rse.main.import.google-sheet:Found software record: https://cran.r-project.org/web/packages/warbleR/index.html
Found 70 results
ERROR:rse.utils.urls:Cannot find endpoint https://api.github.com/repos/master/calc_audioset_feats.
Traceback (most recent call last):
  File "/home/pimienta/Documents/repos/coding/opensci/bioacoustics/bioacoustics-software/.venv/bin/rse", line 33, in <module>
    sys.exit(load_entry_point('rse', 'console_scripts', 'rse')())
  File "/home/pimienta/Documents/repos/coding/opensci/bioacoustics/rse/rse/client/__init__.py", line 520, in main
    main(args=args, extra=extra)
  File "/home/pimienta/Documents/repos/coding/opensci/bioacoustics/rse/rse/client/imp.py", line 28, in main
    importer.create(
  File "/home/pimienta/Documents/repos/coding/opensci/bioacoustics/rse/rse/main/scrapers/googlesheet.py", line 99, in create
    result = update_nonempty(result, data)
  File "/home/pimienta/Documents/repos/coding/opensci/bioacoustics/rse/rse/utils/strings.py", line 13, in update_nonempty
    for key, value in source.items():
AttributeError: 'NoneType' object has no attribute 'items'

I can't see anything specifically different about the link that causes the crash

https://github.com/sarabsethi/audioset_soundscape_feats_sethi2019/tree/master/calc_audioset_feats

I notice that when I click on the link I get a redirect from GitHub?
But this happens for all the links in the rows above it too.

vsoch commented 2 years ago

Row 11 of your spreadsheet is not a valid GitHub identifier:

https://github.com/sarabsethi/audioset_soundscape_feats_sethi2019/tree/master/calc_audioset_feats

That's why it says "cannot find endpoint" and shows that partial url.

NickleDave commented 2 years ago

:+1: confirmed that if I change row 11 to just https://github.com/sarabsethi/audioset_soundscape_feats_sethi2019 (without the /tree/master/calc_audioset_feats) then this row no longer throws an error

NickleDave commented 2 years ago

"valid GitHub identifier" is defined as just the landing page of a repo?
I.e. the url without anything past the project name. just making sure I understand

NickleDave commented 2 years ago

I got further this time but got a 403, rate limit exceeded

:sob:

Is that maybe cuz I've been running it over and over again?
Or could this happen any time it runs on a sheet with greater than x rows?

INFO:rse.main.database.filesystem:github/BirdVox/birdvoxdetect was added to the the database.
INFO:rse.main.database.filesystem:github/OpenWild/caracal was added to the the database.
INFO:rse.main.database.filesystem:github/vocalpy/crowsetta was added to the the database.
INFO:rse.main.database.filesystem:github/MarineBioAcousticsRC/DetEdit was added to the the database.
INFO:rse.main.database.filesystem:github/DrCoffey/DeepSqueak was added to the the database.
INFO:rse.main.database.filesystem:github/nilomr/fieldtools was added to the the database.
INFO:rse.main.database.filesystem:github/DenaJGibbon/gibbonR-package was added to the the database.
INFO:rse.main.database.filesystem:custom/www-oldbird-org/glassofire-htm was added to the the database.
INFO:rse.main.database.filesystem:custom/www-goldwave-com was added to the the database.
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/Cdevenish/hardRain/topics: 403, rate limit exceeded
INFO:rse.main.database.filesystem:github/Cdevenish/hardRain was added to the the database.
INFO:rse.main.database.filesystem:custom/sites-google-com/view/alcore-suzuki/home/harkbird was added to the the database.
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/vocalpy/hybrid-vocal-classifier: 403, rate limit exceeded
Traceback (most recent call last):
  File "/home/pimienta/Documents/repos/coding/opensci/bioacoustics/bioacoustics-software/.venv/bin/rse", line 33, in <module>
    sys.exit(load_entry_point('rse', 'console_scripts', 'rse')())
  File "/home/pimienta/Documents/repos/coding/opensci/bioacoustics/rse/rse/client/__init__.py", line 520, in main
    main(args=args, extra=extra)
  File "/home/pimienta/Documents/repos/coding/opensci/bioacoustics/rse/rse/client/imp.py", line 28, in main
    importer.create(
  File "/home/pimienta/Documents/repos/coding/opensci/bioacoustics/rse/rse/main/scrapers/googlesheet.py", line 99, in create
    result = update_nonempty(result, data)
  File "/home/pimienta/Documents/repos/coding/opensci/bioacoustics/rse/rse/utils/strings.py", line 13, in update_nonempty
    for key, value in source.items():
AttributeError: 'NoneType' object has no attribute 'items'
vsoch commented 2 years ago

Ah yes! So this is common whenever using the GitHub api in any context, if you export GITHUB_TOKEN=xxx (as a personal access token) it should increase your limit. It only happened because you ran the same thing many times (and there are a lot of rows!)

A valid GitHub identifier matches:

https://github.com/<user-or-org>/<repo>

And right now it's doing a very basic matching and using some split logic to derive the underlying name. Looking at this now, I could improve upon it - I'll open an issue for future me to work on :) The reason it's done simply is that (at least before this googlesheet addition) most GitHub urls came from places where they would always be in that format. Now that's not the case so likely my parser needs to account for that.

NickleDave commented 2 years ago

Ah yes! So this is common whenever using the GitHub api in any context, if you export GITHUB_TOKEN=xxx (as a personal access token) it should increase your limit. It only happened because you ran the same thing many times (and there are a lot of rows!)

Got it, thank you

I think I can close this then

... GitHub urls ...

So just to make sure I understand the source of the error:
this doesn't get made into a "custom" entry in the database, because some logic somewhere recognizes the "github" in the url, but then it tries to parse and fails because it's not a valid ID, and we currently require strict IDs

vsoch commented 2 years ago

Correct! And I opened an issue #77 so that you should be able to provide a poorly formatted GitHub URL and still be able to retrieve the metadata. Have your cake and eat it too :) :cake:

NickleDave commented 2 years ago

image

Excellent, will close in anticipation of cake