proycon / codemeta-harvester

Harvest and aggregate codemeta/schema.org software metadata from source repositories and service endpoints, automatically converting from known metadata schemes in the process
GNU General Public License v3.0
8 stars 4 forks source link

Parsing Authors file #9

Open broeder-j opened 2 years ago

broeder-j commented 2 years ago

The AUTHORS file is currently parsed by codemeta-harvester. However this file is pretty much free format. people write free text in there, lists look completely different, so I think there is now way one can parse names from this file reliable.

Currently one ends up with all kinds of wrong authors like:

[{'@type': 'Person', 'familyName': 'active developers:', 'givenName': 'Current', 'position': 1}, {'@type': 'Person', 'familyName': 'would also like to thank the following people for their contibution:', 'givenName': 'We', 'position': 5}, {'@type': 'Person', 'familyName': 'ackowledge discussions input from:', 'givenName': 'We', 'position': 8}]

this needs to be done smarter. Currently I have no good idea except so switch it off and get the authors from the Citation file and then git history per default. Maybe exclude people whose Names are not found somewhere in a given AUTHORS file, to sort out some strange contributors from the git history.

proycon commented 2 years ago

Yes, this is indeed an issue, the same goes for CONTRIBUTORS and MAINTAINERS. Codemetapy only supports a certain simple list format that is used often (see https://github.com/proycon/codemetapy/blob/master/codemeta/parsers/authors.py), and tries to be fairly flexible.

I'd rather not turn it off, but I think we need some extra validation and ignore AUTHORS/CONTRIBUTORS/MAINTAINERS file that are too different from what we expect.

broeder-j commented 2 years ago

This solution would be also fine of course.