pgcorpus gutenberg issues

pgcorpus / gutenberg

Pipeline to generate the Standardized Project Gutenberg Corpus

https://zenodo.org/record/2422561

GNU General Public License v3.0

158 stars 38 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Rsync Error

#50 AlanikREDAWN closed 1 month ago
0
Is this repo still actively maintained?

#49 d-kleine opened 2 months ago
0
skipped due to duplication

#48 Felix-liu0989 opened 8 months ago
0
Fixed typos, an oversight regarding nltk data download, and added support for multi-threading/processing, Windows, ignoring UTF-8 decoding failures, etc.

#47 Trenza1ore opened 10 months ago
0
Fix: Only download missing files with rsync

#46 YertleTurtleGit opened 10 months ago
0
no data stored in bookshelves_ebooks_dict.pkl and bookshelves_categories_dict.pkl after successful running

#45 kaapivalli opened 1 year ago
3
Storing raw data in a compressed format

#44 PadLex opened 1 year ago
0
include size of processed corpus in README

#43 erikfredner opened 1 year ago
3
Paths

#42 alex-raw opened 2 years ago
0
feat(get_data): fallback to 'ibiblio' and add server argument

#41 alex-raw opened 2 years ago
0
"Connection refused"

#40 danielplatt opened 2 years ago
1
Bug fix on BS wget request, changed BS dump to dicts

#39 gabriele-di-bona closed 2 years ago
1
Bookshelves

#38 nofreewill42 closed 2 years ago
2
Not windows-friendly things

#37 fontclos opened 4 years ago
2
pandas

#36 iandoug closed 4 years ago
3
metadata error handling and alternative URL for RDF files

#35 fontclos closed 4 years ago
0
File not found on Windows 10

#34 luigiusai closed 4 years ago
3
indicate in README that SPGC-2018-07-18 doesn't contain full texts

#33 bpshaver closed 4 years ago
1
get_data.py fails: ReadError

#32 maxbry closed 4 years ago
5
Allow for retrieving epubs files?

#31 hneutr closed 4 years ago
1
rsync command fails on Windows 10

#30 andreluizgit closed 4 years ago
13
Getting info about the data before download

#29 edilsonacjr opened 5 years ago
6
Fix locale utf8

#28 fontclos closed 5 years ago
0
Processing fails when locale.getpreferredencoding() does not return UTF-8

#27 fontclos closed 5 years ago
0
passing UTF-8 encoding explicitely when opening html files

#26 fontclos closed 5 years ago
0
parse_bookshelves() fails due to encoding issue

#25 fontclos closed 5 years ago
1
"Copyright Renewal" text

#24 martingerlach closed 5 years ago
2
added gnu license

#23 fontclos closed 5 years ago
0
Add a LICENSE

#22 fontclos closed 5 years ago
0
python get_data: 'metadata/bookshelves' is not a directory

#21 martingerlach closed 6 years ago
5
remove notebooks and all jupyter stuff

#20 fontclos closed 6 years ago
1
Simplify requirements files

#19 fontclos closed 6 years ago
2
Recode bookshelves

#18 fontclos closed 6 years ago
1
Parse bookshelves

#17 fontclos closed 6 years ago
0
Bookshelves metadata is not automatically generated

#16 fontclos closed 6 years ago
4
Add dummy files

#15 fontclos closed 6 years ago
0
Metadata contains books that are not in data/

#14 martingerlach closed 6 years ago
1
Log info

#13 fontclos closed 6 years ago
0
Choosing right tokenizer

#12 fontclos closed 6 years ago
0
Duplicates detection

#11 fontclos closed 6 years ago
1
Create lists of counts

#10 fontclos closed 6 years ago
1
Getting error when running `python get_data.py`

#9 fontclos closed 6 years ago
3
added missing final newline in counts and tokens files writers

#8 fontclos closed 7 years ago
0
added nltk data

#7 fontclos closed 7 years ago
0
NLTK tokenizer always using english trained model

#6 fontclos closed 6 years ago
3
NLTK tokenizer missing on fresh run

#5 fontclos closed 7 years ago
2
Fran

#4 fontclos closed 7 years ago
0
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 12, column 65

#3 fontclos closed 6 years ago
1
Missing newline at the end of counts files

#2 fontclos closed 7 years ago
1
ValueError: The specified mirror directory does not exist when running 'python get_data.py'

#1 martingerlach closed 6 years ago
5