Closed NickleDave closed 1 year ago
Most of those are malformed because they are not GitHub URLs in the format https://github.com/<user>/<repo>
. A handful look OK but we can't know for sure if it would work because "403, rate limit exceeded"
And that's correct, it's expecting a full formed GitHub url, not a link to an index.html, readthe docs, or anywhere else.
None of these are GitHub--the first ~30 that are malformed.
When I inspect with a breakpoint I see that the re
used to match parser <-> uri returns a CustomParser, at least for the first one (www.adobe.com
)
Are they supposed to parse as custom automagically or am I misunderstanding how rse
works?
How come sometimes the cran
ones work and other times they don't?
Not trying to be a grumpy pain, I'm just trying to understand the intended functionality.
WARNING:rse.main.import.csv:Skipping malformed entry www.adobe.com/products/audition.html
WARNING:rse.main.import.csv:Skipping malformed entry www.titley-scientific.com/us/anabat-insight.html
WARNING:rse.main.import.csv:Skipping malformed entry datadryad.org/stash/dataset/doi:10.5061/dryad.221mq23
WARNING:rse.main.import.csv:Skipping malformed entry arbimon.rfcx.org/
WARNING:rse.main.import.csv:Skipping malformed entry soundanalysis.wp.st-andrews.ac.uk/
WARNING:rse.main.import.csv:Skipping malformed entry www.audacityteam.org/download/
WARNING:rse.main.import.csv:Skipping malformed entry autoencoded-vocal-analysis.readthedocs.io/en/latest/index.html
WARNING:rse.main.import.csv:Skipping malformed entry www.avianz.net/index.php
WARNING:rse.main.import.csv:Skipping malformed entry www.avisoft.com/sound-analysis/
WARNING:rse.main.import.csv:Skipping malformed entry bitbucket.org/chrisscott/batclassify/src
WARNING:rse.main.import.csv:Skipping malformed entry www.batlogger.com/en/products/batexplorer/
WARNING:rse.main.import.csv:Skipping malformed entry www.wsl.ch/en/services-and-products/software-websites-and-apps/batscope-4.html
WARNING:rse.main.import.csv:Skipping malformed entry cran.r-project.org/web/packages/bioacoustics/index.html
WARNING:rse.main.import.csv:Skipping malformed entry birdnet.cornell.edu/
WARNING:rse.main.import.csv:Skipping malformed entry www.oldbird.org/glassofire.htm
WARNING:rse.main.import.csv:Skipping malformed entry www.goldwave.com/
WARNING:rse.main.import.csv:Skipping malformed entry sites.google.com/view/alcore-suzuki/home/harkbird
WARNING:rse.main.import.csv:Skipping malformed entry bioacoustics.us/ishmael.html
WARNING:rse.main.import.csv:Skipping malformed entry www.wildlifeacoustics.com/products/kaleidoscope-pro
WARNING:rse.main.import.csv:Skipping malformed entry meridian.cs.dal.ca/2015/04/12/ketos/
WARNING:rse.main.import.csv:Skipping malformed entry koe.io.ac.nz/
Can you give me your sheets URL and the command here so I can reproduce?
Sorry, was working on it. I should've thought to include it in the first place.
.csv attached. I'm doing
rse import --type csv copy-Bioacoustics-software.csv
I think I'm seeing the ones I'd expect to be imported, imported
database/
└── github
├── BirdVox
│ ├── birdvoxclassify
│ │ └── metadata.json
│ └── birdvoxdetect
│ └── metadata.json
├── Cdevenish
│ └── hardRain
│ └── metadata.json
├── ChristianBergler
│ └── ANIMAL-SPOT
│ └── metadata.json
├── DanWoodrich
│ └── INSTINCT
│ └── metadata.json
├── DenaJGibbon
│ └── gibbonR-package
│ └── metadata.json
├── DrCoffey
│ └── DeepSqueak
│ └── metadata.json
├── EricArcher
│ └── banter
│ └── metadata.json
├── kitzeslab
│ └── opensoundscape
│ └── metadata.json
├── macaodha
│ └── batdetect
│ └── metadata.json
├── macster110
│ └── aipam
│ └── metadata.json
├── MarineBioAcousticsRC
│ └── DetEdit
│ └── metadata.json
├── nilomr
│ └── fieldtools
│ └── metadata.json
├── nwolek
│ └── audiomoth-scripts
│ └── metadata.json
├── OpenWild
│ └── caracal
│ └── metadata.json
├── patriceguyot
│ └── Acoustic_Indices
│ └── metadata.json
├── rhine3
│ └── specky
│ └── metadata.json
├── sarabsethi
│ └── audioset_soundscape_feats_sethi2019
│ └── metadata.json
├── scikit-maad
│ └── scikit-maad
│ └── metadata.json
├── shivChitinous
│ └── prinia-project
│ └── metadata.json
├── TaikiSan21
│ └── PAMr
│ └── metadata.json
├── timsainb
│ └── AVGN
│ └── metadata.json
├── vocalpy
│ ├── crowsetta
│ │ └── metadata.json
│ ├── hybrid-vocal-classifier
│ │ └── metadata.json
│ └── vak
│ └── metadata.json
├── YannickJadoul
│ └── Parselmouth
│ └── metadata.json
├── yardencsGitHub
│ └── tweetynet
│ └── metadata.json
└── YvesBas
└── Tadarida-C
└── metadata.json
54 directories, 28 files
I don't have spreadsheet software handy so I can't open the csv to see if you have all the required fields for each, but I'd check that.
Thank you for taking the time to check.
I am confused about why I can no longer generate records for entries that worked previously.
For example, see this custom record for arbimon that was generated with the inital run:
https://github.com/NickleDave/bioacoustics-software/blob/add-rse/database/custom/arbimon-rfcx-org/metadata.json
From the time stamp I can see it was generated on 2022-08-18:
https://github.com/NickleDave/bioacoustics-software/blob/bdebcc7669c05f7a92e275af34c3ce6acf6a75c5/database/custom/arbimon-rfcx-org/metadata.json#L26
"timestamp": "2022-08-18 14:26:06.351778"
This is when I first ran import I think.
But now this entry is considered malformed.
In fact, if I'm in my feature branch and I do rm -rf database
and then rerun rse import --type csv copy-Bioacoustics-software.csv
, then none of the custom
entries get re-generated. I only have database/github
.
$ git st
On branch add-rse
Your branch is up to date with 'origin/add-rse'.
Changes not staged for commit:
(use "git add/rm <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
deleted: database/custom/arbimon-rfcx-org/metadata.json
deleted: database/custom/autoencoded-vocal-analysis-readthedocs-io/en/latest/index-html/metadata.json
deleted: database/custom/bioacoustics-us/ishmael-html/metadata.json
deleted: database/custom/birdnet-cornell-edu/metadata.json
...
deleted: database/custom/www-wildlifeacoustics-com/products/kaleidoscope-pro/metadata.json
deleted: database/custom/www-wsl-ch/en/services-and-products/software-websites-and-apps/batscope-4-html/metadata.json
modified: database/github/BirdVox/birdvoxclassify/metadata.json
modified: database/github/BirdVox/birdvoxdetect/metadata.json
...
If I roll back to rse==0.0.44
and I import directly from --google-sheet
then I am able to generate those custom records again.
Is there something specific about the csv importer that makes those not work?
This is what I ran with rse==0.0.44
to generate a database that does include the custom records, if it helps:
rse import --type google-sheet "https://docs.google.com/spreadsheets/d/e/2PACX-1vQkPsu14BG0bErrY0thXymfS55be0spEVX_WpWm2Yy3We8swMO0sIb3iD4Sg-i1lWnxSsiiN5JmWAD-/pub?gid=0&single=true&output=csv"
Perhaps the place to start is to compare the google sheet to your csv? They use the same underlying logic. If you want to walk through in detail check out IPython. As a sanity check I’d also try Google Sheet without “rolling back” as it still exists in the new release.
Ah, sorry I wasn't clear.
It's literally the same sheet, I haven't done anything to it, I just downloaded it.
I will test using the --google-sheet import with the current version
I tested with the current version and I no longer get the "custom" entries when I do
rse import --type google-sheet "https://docs.google.com/spreadsheets/d/e/2PACX-1vQkPsu14BG0bErrY0thXymfS55be0spEVX_WpWm2Yy3We8swMO0sIb3iD4Sg-i1lWnxSsiiN5JmWAD-/pub?gid=0&single=true&output=csv"
It seems like all the custom entries are considered malformed.
This is true for rse==0.0.45
also: I don't get the custom
entries.
I do for rse==0.0.44
though.
I think it's because the continue
statement in the block that issues the warning about malformed entries prevents the inner loop from reaching the logic that creates custom entries?
i.e., if this line
https://github.com/rseng/rse/blob/59820541a165de66e1a4fb5c7c32b64660979051/rse/main/scrapers/csv.py#L117
runs because repo
is a CustomParser
instance that returns an empty dict
from get_metadata
, then we never get to
https://github.com/rseng/rse/blob/59820541a165de66e1a4fb5c7c32b64660979051/rse/main/scrapers/csv.py#L121
Nice you have a hypothesis! It looks like a bug I think - does it work when you comment it out?
Please see #84 it should fix the issues here! It was a bit more than the continue - a change in design that I didn't properly implement.
Description
Hi @vsoch thank you again for adding the ability to import .csv files directly.
I'm working on a script to clean up a .csv but I'm not actually able to figure out why the scraper reports some entries are malformed, instead of turning them into a "custom" repo. I think it has something to do with the format of the url?
Below I'll put output from running. Seems like anything that's not "https://" will fail.
Is that so? Is it in the docs somewhere that I need to have all urls be this format and I'm missing it? I don't find anything about the custom parser.
Sorry if I'm misunderstanding what the main loop in
CSVImporter.create
is doing.What I Did
Here's the report of malformed entries.
Seems like anything that's not "https://" fails. E.g., "www.", "readthedocs.", etc.
Yes I haven't set up a GitHub token yet