swh / classification-gold-standard

Gold standard for the evaluation of machine classification of patent data
BSD 3-Clause "New" or "Revised" License
7 stars 2 forks source link

datasets annotation #1

Open sofean-mso opened 4 years ago

sofean-mso commented 4 years ago

Hello, Is there any information how did you annotate those datases? from which patent databases did you extract them? Is full text for each document available or only title?

Thank you

cwfparsonson commented 1 year ago

I have the same question.

How was this data set curated? There are many instances of the titles being cut short, not making sense on their own, or just not seeming like titles at all... E.g.: 'Can be used for such as quantum computing which is used for solving the problem that the system and method for'.

Furthermore, many of the ground truth 'positive' and 'negative' labels assigned by the expert writing the paper seem to not make sense. For example, 'A modular array of vertically integrated superconducting qubit - units for scalable quanta data processing' is classed as 'negative' even though it seems highly relevant to hardware quantum qubits, while 'SOLID STATE MATERIAL' and 'SINGLE CRYSTAL CVD DIAMOND AND DEVICES' were classed as 'positive' even though they seemingly have nothing to do with quantum qubit generation.