issues
search
pgcorpus
/
gutenberg
Pipeline to generate the Standardized Project Gutenberg Corpus
https://zenodo.org/record/2422561
GNU General Public License v3.0
158
stars
38
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Rsync Error
#50
AlanikREDAWN
closed
1 month ago
0
Is this repo still actively maintained?
#49
d-kleine
opened
2 months ago
0
skipped due to duplication
#48
Felix-liu0989
opened
8 months ago
0
Fixed typos, an oversight regarding nltk data download, and added support for multi-threading/processing, Windows, ignoring UTF-8 decoding failures, etc.
#47
Trenza1ore
opened
10 months ago
0
Fix: Only download missing files with rsync
#46
YertleTurtleGit
opened
10 months ago
0
no data stored in bookshelves_ebooks_dict.pkl and bookshelves_categories_dict.pkl after successful running
#45
kaapivalli
opened
1 year ago
3
Storing raw data in a compressed format
#44
PadLex
opened
1 year ago
0
include size of processed corpus in README
#43
erikfredner
opened
1 year ago
3
Paths
#42
alex-raw
opened
2 years ago
0
feat(get_data): fallback to 'ibiblio' and add server argument
#41
alex-raw
opened
2 years ago
0
"Connection refused"
#40
danielplatt
opened
2 years ago
1
Bug fix on BS wget request, changed BS dump to dicts
#39
gabriele-di-bona
closed
2 years ago
1
Bookshelves
#38
nofreewill42
closed
2 years ago
2
Not windows-friendly things
#37
fontclos
opened
4 years ago
2
pandas
#36
iandoug
closed
4 years ago
3
metadata error handling and alternative URL for RDF files
#35
fontclos
closed
4 years ago
0
File not found on Windows 10
#34
luigiusai
closed
4 years ago
3
indicate in README that SPGC-2018-07-18 doesn't contain full texts
#33
bpshaver
closed
4 years ago
1
get_data.py fails: ReadError
#32
maxbry
closed
4 years ago
5
Allow for retrieving epubs files?
#31
hneutr
closed
4 years ago
1
rsync command fails on Windows 10
#30
andreluizgit
closed
4 years ago
13
Getting info about the data before download
#29
edilsonacjr
opened
5 years ago
6
Fix locale utf8
#28
fontclos
closed
5 years ago
0
Processing fails when locale.getpreferredencoding() does not return UTF-8
#27
fontclos
closed
5 years ago
0
passing UTF-8 encoding explicitely when opening html files
#26
fontclos
closed
5 years ago
0
parse_bookshelves() fails due to encoding issue
#25
fontclos
closed
5 years ago
1
"Copyright Renewal" text
#24
martingerlach
closed
5 years ago
2
added gnu license
#23
fontclos
closed
5 years ago
0
Add a LICENSE
#22
fontclos
closed
5 years ago
0
python get_data: 'metadata/bookshelves' is not a directory
#21
martingerlach
closed
6 years ago
5
remove notebooks and all jupyter stuff
#20
fontclos
closed
6 years ago
1
Simplify requirements files
#19
fontclos
closed
6 years ago
2
Recode bookshelves
#18
fontclos
closed
6 years ago
1
Parse bookshelves
#17
fontclos
closed
6 years ago
0
Bookshelves metadata is not automatically generated
#16
fontclos
closed
6 years ago
4
Add dummy files
#15
fontclos
closed
6 years ago
0
Metadata contains books that are not in data/
#14
martingerlach
closed
6 years ago
1
Log info
#13
fontclos
closed
6 years ago
0
Choosing right tokenizer
#12
fontclos
closed
6 years ago
0
Duplicates detection
#11
fontclos
closed
6 years ago
1
Create lists of counts
#10
fontclos
closed
6 years ago
1
Getting error when running `python get_data.py`
#9
fontclos
closed
6 years ago
3
added missing final newline in counts and tokens files writers
#8
fontclos
closed
7 years ago
0
added nltk data
#7
fontclos
closed
7 years ago
0
NLTK tokenizer always using english trained model
#6
fontclos
closed
6 years ago
3
NLTK tokenizer missing on fresh run
#5
fontclos
closed
7 years ago
2
Fran
#4
fontclos
closed
7 years ago
0
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 12, column 65
#3
fontclos
closed
6 years ago
1
Missing newline at the end of counts files
#2
fontclos
closed
7 years ago
1
ValueError: The specified mirror directory does not exist when running 'python get_data.py'
#1
martingerlach
closed
6 years ago
5
Next