radiolarian / AO3Scraper

A Python scraper for getting fan fiction content and metadata from Archive of Our Own.
175 stars 56 forks source link

Issue with extract_metadata #13

Closed thaumivore closed 3 years ago

thaumivore commented 3 years ago

In advance, I'm really sorry if this is a dumb question, I only learned Python about two weeks ago.

I'm having this issue when i run extract_metadata on the output of ao3_get_fanfics:

extract_metadata.py:26: DeprecationWarning: 'U' mode is deprecated with open(csv_name, 'rU') as csvfile: Traceback (most recent call last): File "extract_metadata.py", line 36, in <module> main() File "extract_metadata.py", line 31, in main work_id = row[0] IndexError: list index out of range

none of the solutions i've tried thus far have worked, do you have any insight?

ssterman commented 3 years ago

Can you share the file you are trying to parse? Looks like the parsing of the csv thinks it's empty (i.e. "IndexError: list index out of range" means missing column 0).

thaumivore commented 3 years ago

csv in the zipped file - it's the output of running ao3_get_fanfics.py

fanfics.zip

ssterman commented 3 years ago

Thanks. The problem you saw should be fixed by https://github.com/radiolarian/AO3Scraper/commit/af61b898f2b2a566d3a81870f67185fc9f9a7f42. It was due to misformatted empty lines in the csv, but the script should be robust to that now.

Quick second point -- the order of some new fields in the scraper was incorrect. Unfortunately, this means extract_metadata will not give you the right content since your csv is in the old (bad) order. Two options for you: 1) rescrape, which might be a pain, sorry, or -- 2) the faster but non-compatible with future scrapes solution: on line 34 of extract_metadata.py, change wr.writerow(row[:-1]) to wr.writerow(row[:-3]). This will remove the body text for your csv (which was third from the end in the old ordering). For future scrapes which use the new (fixed) order, change it back to -1.

If you have any questions let me know. Hope that helps.