issues
search
rom1504
/
cc2dataset
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307
stars
23
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Vplatform v2
#49
rom1504
opened
11 months ago
0
Warc support
#48
marianna13
opened
1 year ago
0
investigate if computing count instead of drop duplicates would be fast
#47
rom1504
opened
1 year ago
0
adapt number of output files based on document type
#46
rom1504
opened
1 year ago
1
Improve spark builder
#45
rom1504
closed
1 year ago
0
Add image_only document type.
#44
rom1504
closed
1 year ago
1
Investigate implementation of url / metadata predictors
#43
rom1504
opened
1 year ago
1
Implement relative links while keeping validation functions pure
#42
rom1504
closed
1 year ago
3
Save warc filename & URL of webpage
#41
marianna13
closed
1 year ago
2
Resolve relative links to absolute URLs
#40
sebastian-nagel
closed
1 year ago
2
Extract robots metatags
#39
sebastian-nagel
opened
1 year ago
0
Revamp cc2dataset warc text extraction
#38
harry-stark
opened
1 year ago
0
Add diagram of the advised processing pipeline
#37
rom1504
opened
1 year ago
0
add video platform
#36
rom1504
opened
1 year ago
1
advise on stage 2
#35
rom1504
opened
1 year ago
0
consider expanding to WARC
#34
rom1504
opened
1 year ago
2
Add more text document type
#33
rom1504
closed
1 year ago
0
fix test
#32
rom1504
closed
1 year ago
2
check structured CC extraction
#31
rom1504
opened
1 year ago
2
Add thanks section
#30
rom1504
closed
1 year ago
0
support audio platform
#29
rom1504
opened
1 year ago
0
get rid of useless spark warnings / improve speed logging
#28
rom1504
opened
1 year ago
0
support video platform
#27
rom1504
opened
1 year ago
10
support video
#26
rom1504
closed
1 year ago
1
support text document_type
#25
rom1504
closed
1 year ago
1
Add document type param + tests
#24
rom1504
closed
1 year ago
0
Implement restarting the spark app every part
#23
rom1504
closed
1 year ago
3
add some options to make it possible to get other stuff than images
#22
rom1504
closed
1 year ago
4
would doing some parallism when retrieving shards make things faster?
#21
rom1504
closed
1 year ago
3
Consider optionally moving dedup and shuffle to a second step
#20
rom1504
closed
1 year ago
2
Rename to cc2dataset?
#19
rom1504
closed
1 year ago
3
Partition the final merge + shuffle
#18
rom1504
closed
1 year ago
13
consider making dedup optional if local disk limited but remote is not
#17
rom1504
closed
1 year ago
1
shuffle
#16
rom1504
closed
1 year ago
2
add date to output folder
#15
rom1504
closed
1 year ago
1
Investigate using parquet bloom filter to reduce size on disk
#14
rom1504
closed
1 year ago
10
more references
#13
rom1504
closed
1 year ago
0
save input wat list at beginning
#12
rom1504
closed
1 year ago
1
Implement multipart.
#11
rom1504
closed
1 year ago
0
implement multi steps processing to limit disk space need
#10
rom1504
closed
1 year ago
0
Use yield to combine main parsing and dedup.
#9
rom1504
closed
1 year ago
0
put multiple warcs per output file
#8
rom1504
closed
1 year ago
1
wip better s3a
#7
rom1504
closed
1 year ago
1
faster write/read to s3 for dedup
#6
rom1504
closed
1 year ago
12
some numbers
#5
rom1504
closed
1 year ago
5
padd with 0 file names
#4
rom1504
closed
1 year ago
1
udf + s3a write
#3
rom1504
closed
1 year ago
3
pandas udf and dedup
#2
rom1504
closed
1 year ago
2
wip
#1
rom1504
closed
1 year ago
0