Parsing Authors file - Githubissues

broeder-j commented 2 years ago

The AUTHORS file is currently parsed by codemeta-harvester. However this file is pretty much free format. people write free text in there, lists look completely different, so I think there is now way one can parse names from this file reliable.

Currently one ends up with all kinds of wrong authors like:

[{'@type': 'Person', 'familyName': 'active developers:', 'givenName': 'Current', 'position': 1}, {'@type': 'Person', 'familyName': 'would also like to thank the following people for their contibution:', 'givenName': 'We', 'position': 5}, {'@type': 'Person', 'familyName': 'ackowledge discussions input from:', 'givenName': 'We', 'position': 8}]

this needs to be done smarter. Currently I have no good idea except so switch it off and get the authors from the Citation file and then git history per default. Maybe exclude people whose Names are not found somewhere in a given AUTHORS file, to sort out some strange contributors from the git history.

proycon commented 2 years ago

Yes, this is indeed an issue, the same goes for CONTRIBUTORS and MAINTAINERS. Codemetapy only supports a certain simple list format that is used often (see https://github.com/proycon/codemetapy/blob/master/codemeta/parsers/authors.py), and tries to be fairly flexible.

I'd rather not turn it off, but I think we need some extra validation and ignore AUTHORS/CONTRIBUTORS/MAINTAINERS file that are too different from what we expect.

broeder-j commented 2 years ago

This solution would be also fine of course.

proycon / codemeta-harvester

Parsing Authors file #9