Closed sgelb closed 6 years ago
thanks for catching this.
I wasn't expecting an array of info there, and likely should have. the package is just parsing and storing the metadata into a k/v dict (https://github.com/jvanasco/metadata_parser/blob/master/metadata_parser/__init__.py#L1403-L1426) and trying to coerce duplicates into strings (there are a lot of poorly structured documents out there).
url = 'https://arxiv.org/abs/1704.04368v2'
result = MetadataParser(url=url, search_head_only=False)
pprint.pprint(result.parsed_result.__dict__)
I'll definitely fix the library to store the full info in the parsed_result (though i'll have to dedupe it first).
i can't remember if get_metadata
can ever return a string or array in the present usage. i'm leaning towards not changing the behavior of get_metadata
and always return a string, and introducing get_metadatas
to always return an array instead.. I need to dive into the the tests and some of our internal use cases tough.
ok. the master branch has a version that supports what you need via get_metadatas
. that will always return an array, even if there is only one element. The values will be a string unless they are from dublincore - then they will be a dict. the dc elements might have a tertiary lang
or schema
attribute used to differentiate, so a simple string won't work.
i added a bunch of tests to ensure we pull all the info in duplicate scenarios. there are a few edge cases with dc elements that I need to handle, but all the other types of metadata should be stable.
once i work out the dc stuff, I'll push this to pypi as 0.9.19
.
0.9.19
released on PyPi last night. This is working fine on our production spider and a handful of internal libraries that use it to parse extra data.
I updated the README with some breaking changes, and am reproducing them below.
Version 0.9.19 Breaking Changes Issue #12 exposed some flaws in the existing package
Until version 0.9.19, the recommended way to get metadata was to use get_metadata which will either return a string (or None).
Starting with version 0.9.19, the recommended way to get metadata is to use get_metadatas which will always return a list (or None).
This change was made because the library incorrectly stored a single metadata key value when there were duplicates.
Many users (including the maintainer) archive the parsed metadata. After testing a variety of payloads with an all-list format and a mixed format (string or list), a mixed format had a much smaller payload size with a negligible performance hit.
Tests were added to handle dublincore data. An extra attribute may be needed to represent the payload, so always returning a dict with at least a name+content (and possibly 'lang' or 'scheme' is the best approach.
For example, https://arxiv.org/abs/1704.04368v2 contains multiple meta tags named "citation_author", but
page.get_metadata("citation_author", strategy=["meta"])
returns the last result only. Any plans for changing that behaviour?