issues
search
mediacloud
/
metadata-lib
How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12
stars
5
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add option to override canonical_domain
#101
vbanos
closed
1 day ago
6
Add version constraint caution for Trafilatura
#100
m453h
closed
1 month ago
0
Add trafilatura version warning to pyproject.yaml?
#99
philbudne
closed
1 month ago
1
Update Authors in pyproject.toml
#98
pgulley
opened
1 month ago
0
Add black formatter to pre-commit hooks
#97
m453h
closed
1 month ago
1
Enable black formatter in Pre-Commit Hooks
#96
m453h
closed
1 month ago
1
Revision Needed: Canonical URL extraction method will break with future versions of trafilatura
#95
m453h
closed
1 month ago
2
Tests fail due to network errors
#94
pgulley
opened
1 month ago
0
Add function urls.is_non_news_domain
#93
philbudne
closed
1 month ago
0
Enable extraction of canonical link information
#92
m453h
closed
1 month ago
6
Add function to detect non-news URLs?
#91
philbudne
closed
1 month ago
0
Add requests_arcana.py from story-indexer, for use across mediacloud projects
#90
philbudne
closed
1 month ago
0
Include "canonical link" information in mcmetadata.extract if present.
#89
philbudne
closed
1 month ago
3
Add routine to return configured requests.Session object
#88
philbudne
closed
1 month ago
0
MC metadata extraction investigation
#87
pgulley
closed
4 months ago
0
Assess tweaks to content extraction to remove headlines at end of article
#86
rahulbot
opened
7 months ago
2
Update htmldate requirement from ==1.7.* to >=1.7,<1.9
#85
dependabot[bot]
closed
8 months ago
1
Update trafilatura requirement from <1.7,>=1.4 to >=1.4,<1.9
#84
dependabot[bot]
closed
8 months ago
1
Further tweaking of User-Agent string?
#83
philbudne
closed
9 months ago
3
central storage for User-Agent to use across MC projects
#82
rahulbot
closed
9 months ago
1
store MC user-agent for use by our other libraries
#81
rahulbot
closed
9 months ago
0
Not capturing full article text
#80
jaypinho
closed
9 months ago
1
Update trafilatura requirement from <1.7,>=1.4 to >=1.4,<1.8
#79
dependabot[bot]
closed
8 months ago
1
Get automated release working
#78
rahulbot
closed
6 months ago
2
ignore ports & handle IP domains in `normalize_url`
#77
rahulbot
closed
10 months ago
0
Update requirements
#76
rahulbot
closed
10 months ago
1
Update htmldate requirement from ==1.6.* to >=1.6,<1.8
#75
dependabot[bot]
closed
10 months ago
2
Fix title parsing failure (due to empty or whitespace title tag)
#74
rahulbot
closed
10 months ago
1
mcmetadata.extract throwing AttributeErrors
#73
philbudne
closed
10 months ago
3
possible url normalization issues
#72
philbudne
opened
11 months ago
1
Update static test fixtures
#71
rahulbot
closed
11 months ago
0
centralize url unique hash generation with helper method in this package
#70
rahulbot
closed
11 months ago
1
improve CI test run reliabiility by using cached fixtures?
#69
rahulbot
closed
11 months ago
0
allow capturing stats from individual extract calls
#68
rahulbot
closed
1 year ago
0
May want to remove story source related query parameters!
#67
philbudne
closed
1 year ago
1
update requirements file to latest
#66
rahulbot
closed
1 year ago
0
Small tweaks to handle whitespace in URLs
#65
rahulbot
closed
1 year ago
0
Support defaults and overrides in `extract`
#64
rahulbot
closed
1 year ago
0
support passing in a fallback publication date
#63
rahulbot
closed
1 year ago
2
Update htmldate requirement from ==1.5.* to >=1.5,<1.7
#62
dependabot[bot]
closed
1 year ago
2
Discuss possible enhancements to mcmetadata.extract
#61
philbudne
closed
1 year ago
2
Update dateparser requirement from ==1.1.* to >=1.1,<1.3
#60
dependabot[bot]
closed
1 year ago
2
Update tldextract requirement from ==3.6.* to >=3.6,<5.2
#59
dependabot[bot]
closed
1 year ago
2
Handling of URL parse failure
#58
philbudne
closed
1 year ago
0
Update tldextract requirement from ==3.6.* to >=3.6,<5.1
#57
dependabot[bot]
closed
1 year ago
1
Update tldextract requirement from ==3.4.* to >=3.4,<3.7
#56
dependabot[bot]
closed
1 year ago
1
Update tldextract requirement from ==3.4.* to >=3.4,<3.6
#55
dependabot[bot]
closed
1 year ago
1
Update htmldate requirement from ==1.4.* to >=1.4,<1.6
#54
dependabot[bot]
closed
1 year ago
1
Switched from cchardet to faust-chardet, as the former is unmantained…
#53
pgulley
closed
1 year ago
0
mcmetadata not type checked by mypy
#52
philbudne
closed
1 year ago
2
Next