issues
search
mediacloud
/
metadata-lib
How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12
stars
5
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Update trafilatura requirement from ==1.4.* to >=1.4,<1.7
#51
dependabot[bot]
closed
1 year ago
0
update to latest version of trafilatura
#50
rahulbot
closed
1 year ago
1
Update trafilatura requirement from ==1.4.* to >=1.4,<1.6
#49
dependabot[bot]
closed
1 year ago
1
Update beautifulsoup4 requirement from ==4.11.* to >=4.11,<4.13
#48
dependabot[bot]
closed
1 year ago
0
fix bugs from PT integration
#47
rahulbot
closed
1 year ago
0
addressing no nk error
#46
pgulley
closed
1 year ago
1
Crash because uri.query.params['nk'] can be None
#45
vbanos
closed
1 year ago
2
Feature feed normalization
#44
rahulbot
closed
1 year ago
0
Add feed_url.py
#43
philbudne
closed
1 year ago
0
handle IP addresses better
#42
rahulbot
closed
1 year ago
1
Add a a check to avoid TypeError
#41
vbanos
closed
1 year ago
1
Update htmldate requirement from ==1.3.* to >=1.3,<1.5
#40
dependabot[bot]
closed
2 years ago
0
Update trafilatura requirement from ==1.3.* to >=1.3,<1.5
#39
dependabot[bot]
closed
1 year ago
0
Update tldextract requirement from ==3.3.* to >=3.3,<3.5
#38
dependabot[bot]
closed
2 years ago
0
assess fasttext for language guessing speedup
#37
rahulbot
closed
11 months ago
1
upgrade dependencies
#36
rahulbot
closed
2 years ago
3
Fallback extractor
#35
pgulley
closed
2 years ago
0
handle empty content with no-encoding from HTML
#34
rahulbot
closed
1 year ago
0
Unexpected AttributeError on extract
#33
vbanos
closed
1 year ago
1
Improvement regarding content decoding/encoding
#32
vbanos
opened
2 years ago
1
Bug in extract method
#31
vbanos
closed
1 year ago
1
Use latest htmldate and pass datetime max_date instead of string
#30
vbanos
closed
2 years ago
0
add in top image and other metadata
#29
rahulbot
closed
2 years ago
2
More efficient parameterized unit tests
#28
vbanos
closed
2 years ago
1
optimization on tag removal in readability-lxml extraction fallback
#27
rahulbot
closed
2 years ago
0
improve trafilatura defaults
#26
rahulbot
closed
2 years ago
0
create larger test set to compare results to main system data
#25
rahulbot
closed
2 years ago
1
don't lowercase YouTube URLs for uniqueness hashing
#24
rahulbot
closed
2 years ago
0
limit dates in future?
#23
rahulbot
closed
2 years ago
2
Masking very frequent date parsing exceptions
#22
vbanos
closed
2 years ago
1
Unhandled exception we got in production
#21
vbanos
closed
2 years ago
4
centralize dependencies in one place
#20
rahulbot
closed
2 years ago
0
You could also compile these regex in this method.
#19
vbanos
closed
2 years ago
0
Use set instead of list for improved performance
#18
vbanos
closed
2 years ago
0
You could compile this regex for better performance
#17
vbanos
closed
2 years ago
0
Use Beautifulsoup4 with lxml parser for faster performance
#16
vbanos
closed
2 years ago
0
Add cchardet dependency to speedup BeautifulSoup4
#15
vbanos
closed
2 years ago
0
investigate URLs failing extraction
#14
rahulbot
closed
2 years ago
2
justify content extractor priorities with data and testing
#13
rahulbot
closed
1 year ago
3
Feature quick improvements
#12
rahulbot
closed
2 years ago
0
Stats for the success / failure of each extractor
#11
vbanos
closed
2 years ago
0
Improve exception handling
#10
vbanos
closed
2 years ago
1
Compile regular expressions to improve performance
#9
vbanos
closed
2 years ago
0
rename core branch from master to main
#8
rahulbot
closed
2 years ago
1
Prep for release to PyPi
#7
rahulbot
closed
2 years ago
2
Extract authors information when possible
#6
ibnesayeed
closed
1 year ago
3
Building and installing cld2-cffi is failing
#5
ibnesayeed
closed
2 years ago
2
Extracting original domain from archived pages
#4
ibnesayeed
closed
2 years ago
1
Exception on non-news article pages
#3
ibnesayeed
closed
2 years ago
0
switch language detection for now
#2
rahulbot
closed
2 years ago
0
Previous
Next