jvanasco / metadata_parser

python library for getting metadata
Other
141 stars 23 forks source link

Return all results of multiple meta tags with the same name #12

Closed sgelb closed 6 years ago

sgelb commented 6 years ago

For example, https://arxiv.org/abs/1704.04368v2 contains multiple meta tags named "citation_author", but page.get_metadata("citation_author", strategy=["meta"]) returns the last result only. Any plans for changing that behaviour?

jvanasco commented 6 years ago

thanks for catching this.

I wasn't expecting an array of info there, and likely should have. the package is just parsing and storing the metadata into a k/v dict (https://github.com/jvanasco/metadata_parser/blob/master/metadata_parser/__init__.py#L1403-L1426) and trying to coerce duplicates into strings (there are a lot of poorly structured documents out there).

url = 'https://arxiv.org/abs/1704.04368v2'
result = MetadataParser(url=url, search_head_only=False)
pprint.pprint(result.parsed_result.__dict__)

I'll definitely fix the library to store the full info in the parsed_result (though i'll have to dedupe it first).

i can't remember if get_metadata can ever return a string or array in the present usage. i'm leaning towards not changing the behavior of get_metadata and always return a string, and introducing get_metadatas to always return an array instead.. I need to dive into the the tests and some of our internal use cases tough.

jvanasco commented 6 years ago

ok. the master branch has a version that supports what you need via get_metadatas. that will always return an array, even if there is only one element. The values will be a string unless they are from dublincore - then they will be a dict. the dc elements might have a tertiary lang or schema attribute used to differentiate, so a simple string won't work.

i added a bunch of tests to ensure we pull all the info in duplicate scenarios. there are a few edge cases with dc elements that I need to handle, but all the other types of metadata should be stable.

once i work out the dc stuff, I'll push this to pypi as 0.9.19 .

jvanasco commented 6 years ago

0.9.19 released on PyPi last night. This is working fine on our production spider and a handful of internal libraries that use it to parse extra data.

I updated the README with some breaking changes, and am reproducing them below.


Version 0.9.19 Breaking Changes Issue #12 exposed some flaws in the existing package

1. MetadataParser.get_metadatas replaces MetadataParser.get_metadata

Until version 0.9.19, the recommended way to get metadata was to use get_metadata which will either return a string (or None).

Starting with version 0.9.19, the recommended way to get metadata is to use get_metadatas which will always return a list (or None).

This change was made because the library incorrectly stored a single metadata key value when there were duplicates.

2. The ParsedResult payload stores mixed content and tracks it's version

Many users (including the maintainer) archive the parsed metadata. After testing a variety of payloads with an all-list format and a mixed format (string or list), a mixed format had a much smaller payload size with a negligible performance hit.

3. DublinCore payloads might be a dict

Tests were added to handle dublincore data. An extra attribute may be needed to represent the payload, so always returning a dict with at least a name+content (and possibly 'lang' or 'scheme' is the best approach.