rseng / rse

tools for assessment and categorization of research software
https://rseng.github.io/rse/
Mozilla Public License 2.0
15 stars 2 forks source link

Not clear why malformed entries are malformed #82

Closed NickleDave closed 1 year ago

NickleDave commented 1 year ago

Description

Hi @vsoch thank you again for adding the ability to import .csv files directly.
I'm working on a script to clean up a .csv but I'm not actually able to figure out why the scraper reports some entries are malformed, instead of turning them into a "custom" repo. I think it has something to do with the format of the url?

Below I'll put output from running. Seems like anything that's not "https://" will fail.
Is that so? Is it in the docs somewhere that I need to have all urls be this format and I'm missing it? I don't find anything about the custom parser.
Sorry if I'm misunderstanding what the main loop in CSVImporter.create is doing.

What I Did

Here's the report of malformed entries.
Seems like anything that's not "https://" fails. E.g., "www.", "readthedocs.", etc.

Yes I haven't set up a GitHub token yet

... all the INFO logs here of "Found software record" ...
Found 70 results
WARNING:rse.main.import.csv:Skipping malformed entry www.adobe.com/products/audition.html
WARNING:rse.main.import.csv:Skipping malformed entry www.titley-scientific.com/us/anabat-insight.html
WARNING:rse.main.import.csv:Skipping malformed entry datadryad.org/stash/dataset/doi:10.5061/dryad.221mq23
WARNING:rse.main.import.csv:Skipping malformed entry arbimon.rfcx.org/
WARNING:rse.main.import.csv:Skipping malformed entry soundanalysis.wp.st-andrews.ac.uk/
WARNING:rse.main.import.csv:Skipping malformed entry www.audacityteam.org/download/
WARNING:rse.main.import.csv:Skipping malformed entry autoencoded-vocal-analysis.readthedocs.io/en/latest/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.avianz.net/index.php
WARNING:rse.main.import.csv:Skipping malformed entry www.avisoft.com/sound-analysis/
WARNING:rse.main.import.csv:Skipping malformed entry bitbucket.org/chrisscott/batclassify/src
WARNING:rse.main.import.csv:Skipping malformed entry www.batlogger.com/en/products/batexplorer/
WARNING:rse.main.import.csv:Skipping malformed entry www.wsl.ch/en/services-and-products/software-websites-and-apps/batscope-4.html
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/bioacoustics/index.html
WARNING:rse.main.import.csv:Skipping malformed entry birdnet.cornell.edu/
WARNING:rse.main.import.csv:Skipping malformed entry www.oldbird.org/glassofire.htm
WARNING:rse.main.import.csv:Skipping malformed entry www.goldwave.com/
WARNING:rse.main.import.csv:Skipping malformed entry sites.google.com/view/alcore-suzuki/home/harkbird
WARNING:rse.main.import.csv:Skipping malformed entry bioacoustics.us/ishmael.html
WARNING:rse.main.import.csv:Skipping malformed entry www.wildlifeacoustics.com/products/kaleidoscope-pro
WARNING:rse.main.import.csv:Skipping malformed entry meridian.cs.dal.ca/2015/04/12/ketos/
WARNING:rse.main.import.csv:Skipping malformed entry koe.io.ac.nz/
ERROR:rse.utils.urls:Cannot find endpoint https://api.github.com/repos/tree/v0.6.5.
WARNING:rse.main.import.csv:Skipping malformed entry github.com/shyamblast/Koogu/tree/v0.6.5
WARNING:rse.main.import.csv:Skipping malformed entry librosa.org/librosa/
ERROR:rse.utils.urls:Cannot find endpoint https://api.github.com/repos/rflachlanhub.io/Luscinia.
WARNING:rse.main.import.csv:Skipping malformed entry rflachlan.github.io/Luscinia/
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/monitoR/index.html
ERROR:rse.utils.urls:Cannot find endpoint https://api.github.com/repos/ohun/index.html.
WARNING:rse.main.import.csv:Skipping malformed entry marce10.github.io/ohun/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.pamguard.org/
WARNING:rse.main.import.csv:Skipping malformed entry www.fon.hum.uva.nl/praat/
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/shivChitinous/prinia-project: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/shivChitinous/prinia-project
WARNING:rse.main.import.csv:Skipping malformed entry ravensoundsoftware.com/software/raven-lite/
WARNING:rse.main.import.csv:Skipping malformed entry ravensoundsoftware.com/software/raven-pro
WARNING:rse.main.import.csv:Skipping malformed entry www.reaper.fm/
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/scikit-maad/scikit-maad: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/scikit-maad/scikit-maad
WARNING:rse.main.import.csv:Skipping malformed entry docs.scipy.org/doc/scipy/reference/signal.html
WARNING:rse.main.import.csv:Skipping malformed entry dx.doi.org/10.6084/m9.figshare.3792780
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/seewave/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.sonicvisualiser.org/
WARNING:rse.main.import.csv:Skipping malformed entry sonobat.com/
WARNING:rse.main.import.csv:Skipping malformed entry doi.org/10.1080/09524622.2013.827588
WARNING:rse.main.import.csv:Skipping malformed entry soundata.readthedocs.io/en/latest/
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/soundecology/vignettes/intro.html
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/macster110/aipam: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/macster110/aipam
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/rhine3/specky: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/rhine3/specky
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/YvesBas/Tadarida-C: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/YvesBas/Tadarida-C
WARNING:rse.main.import.csv:Skipping malformed entry www.cetus.ucsd.edu/technologies_triton.html
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/yardencsGitHub/tweetynet: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/yardencsGitHub/tweetynet
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/vocalpy/vak: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/vocalpy/vak
ERROR:rse.utils.urls:Permission denied to query https://api.github.com/repos/HaroldMills/Vesper: 403, rate limit exceeded
WARNING:rse.main.import.csv:Skipping malformed entry github.com/HaroldMills/Vesper
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/warbleR/index.html
vsoch commented 1 year ago

Most of those are malformed because they are not GitHub URLs in the format https://github.com/<user>/<repo>. A handful look OK but we can't know for sure if it would work because "403, rate limit exceeded"

vsoch commented 1 year ago

And that's correct, it's expecting a full formed GitHub url, not a link to an index.html, readthe docs, or anywhere else.

NickleDave commented 1 year ago

None of these are GitHub--the first ~30 that are malformed.
When I inspect with a breakpoint I see that the re used to match parser <-> uri returns a CustomParser, at least for the first one (www.adobe.com)

Are they supposed to parse as custom automagically or am I misunderstanding how rse works?

How come sometimes the cran ones work and other times they don't?

Not trying to be a grumpy pain, I'm just trying to understand the intended functionality.

WARNING:rse.main.import.csv:Skipping malformed entry www.adobe.com/products/audition.html
WARNING:rse.main.import.csv:Skipping malformed entry www.titley-scientific.com/us/anabat-insight.html
WARNING:rse.main.import.csv:Skipping malformed entry datadryad.org/stash/dataset/doi:10.5061/dryad.221mq23
WARNING:rse.main.import.csv:Skipping malformed entry arbimon.rfcx.org/
WARNING:rse.main.import.csv:Skipping malformed entry soundanalysis.wp.st-andrews.ac.uk/
WARNING:rse.main.import.csv:Skipping malformed entry www.audacityteam.org/download/
WARNING:rse.main.import.csv:Skipping malformed entry autoencoded-vocal-analysis.readthedocs.io/en/latest/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.avianz.net/index.php
WARNING:rse.main.import.csv:Skipping malformed entry www.avisoft.com/sound-analysis/
WARNING:rse.main.import.csv:Skipping malformed entry bitbucket.org/chrisscott/batclassify/src
WARNING:rse.main.import.csv:Skipping malformed entry www.batlogger.com/en/products/batexplorer/
WARNING:rse.main.import.csv:Skipping malformed entry www.wsl.ch/en/services-and-products/software-websites-and-apps/batscope-4.html
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/bioacoustics/index.html
WARNING:rse.main.import.csv:Skipping malformed entry birdnet.cornell.edu/
WARNING:rse.main.import.csv:Skipping malformed entry www.oldbird.org/glassofire.htm
WARNING:rse.main.import.csv:Skipping malformed entry www.goldwave.com/
WARNING:rse.main.import.csv:Skipping malformed entry sites.google.com/view/alcore-suzuki/home/harkbird
WARNING:rse.main.import.csv:Skipping malformed entry bioacoustics.us/ishmael.html
WARNING:rse.main.import.csv:Skipping malformed entry www.wildlifeacoustics.com/products/kaleidoscope-pro
WARNING:rse.main.import.csv:Skipping malformed entry meridian.cs.dal.ca/2015/04/12/ketos/
WARNING:rse.main.import.csv:Skipping malformed entry koe.io.ac.nz/
vsoch commented 1 year ago

Can you give me your sheets URL and the command here so I can reproduce?

NickleDave commented 1 year ago

Sorry, was working on it. I should've thought to include it in the first place.

.csv attached. I'm doing

rse import --type csv copy-Bioacoustics-software.csv

copy-Bioacoustics-software.csv

vsoch commented 1 year ago

I think I'm seeing the ones I'd expect to be imported, imported

database/
└── github
    ├── BirdVox
    │   ├── birdvoxclassify
    │   │   └── metadata.json
    │   └── birdvoxdetect
    │       └── metadata.json
    ├── Cdevenish
    │   └── hardRain
    │       └── metadata.json
    ├── ChristianBergler
    │   └── ANIMAL-SPOT
    │       └── metadata.json
    ├── DanWoodrich
    │   └── INSTINCT
    │       └── metadata.json
    ├── DenaJGibbon
    │   └── gibbonR-package
    │       └── metadata.json
    ├── DrCoffey
    │   └── DeepSqueak
    │       └── metadata.json
    ├── EricArcher
    │   └── banter
    │       └── metadata.json
    ├── kitzeslab
    │   └── opensoundscape
    │       └── metadata.json
    ├── macaodha
    │   └── batdetect
    │       └── metadata.json
    ├── macster110
    │   └── aipam
    │       └── metadata.json
    ├── MarineBioAcousticsRC
    │   └── DetEdit
    │       └── metadata.json
    ├── nilomr
    │   └── fieldtools
    │       └── metadata.json
    ├── nwolek
    │   └── audiomoth-scripts
    │       └── metadata.json
    ├── OpenWild
    │   └── caracal
    │       └── metadata.json
    ├── patriceguyot
    │   └── Acoustic_Indices
    │       └── metadata.json
    ├── rhine3
    │   └── specky
    │       └── metadata.json
    ├── sarabsethi
    │   └── audioset_soundscape_feats_sethi2019
    │       └── metadata.json
    ├── scikit-maad
    │   └── scikit-maad
    │       └── metadata.json
    ├── shivChitinous
    │   └── prinia-project
    │       └── metadata.json
    ├── TaikiSan21
    │   └── PAMr
    │       └── metadata.json
    ├── timsainb
    │   └── AVGN
    │       └── metadata.json
    ├── vocalpy
    │   ├── crowsetta
    │   │   └── metadata.json
    │   ├── hybrid-vocal-classifier
    │   │   └── metadata.json
    │   └── vak
    │       └── metadata.json
    ├── YannickJadoul
    │   └── Parselmouth
    │       └── metadata.json
    ├── yardencsGitHub
    │   └── tweetynet
    │       └── metadata.json
    └── YvesBas
        └── Tadarida-C
            └── metadata.json

54 directories, 28 files

I don't have spreadsheet software handy so I can't open the csv to see if you have all the required fields for each, but I'd check that.

NickleDave commented 1 year ago

Thank you for taking the time to check.

I am confused about why I can no longer generate records for entries that worked previously.

For example, see this custom record for arbimon that was generated with the inital run: https://github.com/NickleDave/bioacoustics-software/blob/add-rse/database/custom/arbimon-rfcx-org/metadata.json From the time stamp I can see it was generated on 2022-08-18:
https://github.com/NickleDave/bioacoustics-software/blob/bdebcc7669c05f7a92e275af34c3ce6acf6a75c5/database/custom/arbimon-rfcx-org/metadata.json#L26

        "timestamp": "2022-08-18 14:26:06.351778"

This is when I first ran import I think.

But now this entry is considered malformed.

In fact, if I'm in my feature branch and I do rm -rf database and then rerun rse import --type csv copy-Bioacoustics-software.csv, then none of the custom entries get re-generated. I only have database/github.

$ git st
On branch add-rse
Your branch is up to date with 'origin/add-rse'.

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        deleted:    database/custom/arbimon-rfcx-org/metadata.json
        deleted:    database/custom/autoencoded-vocal-analysis-readthedocs-io/en/latest/index-html/metadata.json
        deleted:    database/custom/bioacoustics-us/ishmael-html/metadata.json
        deleted:    database/custom/birdnet-cornell-edu/metadata.json
        ...
        deleted:    database/custom/www-wildlifeacoustics-com/products/kaleidoscope-pro/metadata.json
        deleted:    database/custom/www-wsl-ch/en/services-and-products/software-websites-and-apps/batscope-4-html/metadata.json
        modified:   database/github/BirdVox/birdvoxclassify/metadata.json
        modified:   database/github/BirdVox/birdvoxdetect/metadata.json
        ...

If I roll back to rse==0.0.44 and I import directly from --google-sheet then I am able to generate those custom records again.

Is there something specific about the csv importer that makes those not work?

NickleDave commented 1 year ago

This is what I ran with rse==0.0.44 to generate a database that does include the custom records, if it helps:

rse import --type google-sheet "https://docs.google.com/spreadsheets/d/e/2PACX-1vQkPsu14BG0bErrY0thXymfS55be0spEVX_WpWm2Yy3We8swMO0sIb3iD4Sg-i1lWnxSsiiN5JmWAD-/pub?gid=0&single=true&output=csv"
vsoch commented 1 year ago

Perhaps the place to start is to compare the google sheet to your csv? They use the same underlying logic. If you want to walk through in detail check out IPython. As a sanity check I’d also try Google Sheet without “rolling back” as it still exists in the new release.

NickleDave commented 1 year ago

Ah, sorry I wasn't clear.
It's literally the same sheet, I haven't done anything to it, I just downloaded it.

I will test using the --google-sheet import with the current version

NickleDave commented 1 year ago

I tested with the current version and I no longer get the "custom" entries when I do
rse import --type google-sheet "https://docs.google.com/spreadsheets/d/e/2PACX-1vQkPsu14BG0bErrY0thXymfS55be0spEVX_WpWm2Yy3We8swMO0sIb3iD4Sg-i1lWnxSsiiN5JmWAD-/pub?gid=0&single=true&output=csv"

It seems like all the custom entries are considered malformed.

This is true for rse==0.0.45 also: I don't get the custom entries.

I do for rse==0.0.44 though.

NickleDave commented 1 year ago

I think it's because the continue statement in the block that issues the warning about malformed entries prevents the inner loop from reaching the logic that creates custom entries?

i.e., if this line https://github.com/rseng/rse/blob/59820541a165de66e1a4fb5c7c32b64660979051/rse/main/scrapers/csv.py#L117 runs because repo is a CustomParser instance that returns an empty dict from get_metadata, then we never get to https://github.com/rseng/rse/blob/59820541a165de66e1a4fb5c7c32b64660979051/rse/main/scrapers/csv.py#L121

vsoch commented 1 year ago

Nice you have a hypothesis! It looks like a bug I think - does it work when you comment it out?

vsoch commented 1 year ago

Please see #84 it should fix the issues here! It was a bit more than the continue - a change in design that I didn't properly implement.