Issues for aligning mention and context strings

kermitt2 commented 6 years ago

It's not always easy to match software_name and quote (as present in the .csv files).

After applying some basic soft matching, I still have 138 distinct software names not matching the provided quote nor the actual PDF content (out of 5684 software names present in the .csv file if I am not wrong). Usually this is due to some typo and wrong spelling, for instance:

provided software name: CLUSTER W2 actual context: "and aligned using CLUSTAL W2 software (www.ebi.ac.uk/Tools/msa/clustalw2/) (Fig. 2)."
provided software name: Integrative Genome Viewer actual context: "All candidate variants were visually inspected in Integrative Genomics Viewer.""

Sometimes, it is a bit more complicated, for instance the software name here:

provided software name: Electronic Health Records (EHR) actual context: "Until recently thetopics of EHR and ‘Consumer Health Infor-matics’ tended to be considered separately; dis-cussions of the former being mainly centred onhealthcare institutions"

It is easy to fix problems related to casing and missing space, but mismatches as above are complicated to recover.

Here is a file with the 138 problematic software names and their context mention. Note that there are many missing quotes in the .csv file and some missing software names which have not been taken into account in this list (I only used the .csv files so far).

matching_issues_mention_context.txt

jameshowison commented 6 years ago

Right. Our instructions are to copy the software name from the text, but looks like that isn't always happening. We should re-emphasize this and add this to our tests. For existing work we could use this file to have students work through and do fixes.

kermitt2 commented 6 years ago

With the news csv files (and ignoring duplicate identifiers), here is an updated version of the list of mention/context mismatches for the five annotation types (software name, version date, version number, creator and url).

unmatched-creator-mention-context.txt unmatched-software-mention-context.txt unmatched-url-mention-context.txt unmatched-version-date-mention-context.txt unmatched-version-number-mention-context.txt

I obtain these ratios:

Unmatched software mentions: 51 out of 5047 total software mentions (1.01%)
Unmatched version number mentions: 26 out of 1682 total version number mentions (1.55%)
Unmatched version date mentions: 14 out of 162 total version date mentions (8.64%)
Unmatched creator mentions: 50 out of 1709 total creator mentions (2.93%)
Unmatched url mentions: 17 out of 282 total url mentions (6.03%)

Total unmatched mentions: 158 out of 5121 total mentions (3.09%)

I think fixing these issues first would be nice because:

from the machine learning learning perspective, we will learn as many wrongly not annotated examples, which will have a big impact on a supervised model,
it impacts artificially inter-agreement evaluation, which is something important to use to estimate the quality of the data set and to spot problems.

caifand commented 5 years ago

Hi @kermitt2 @jameshowison I want to update a bit on my current progress of checking mismatches. So far I've gone through the 432 mismatches between software_name and context strings. I manually categorized each one and in general there're several classes of errors listed below. I've also marked the approximate count of each class of error here so you can get some sense of the magnitude of each issue. But they are estimates as I cannot guarantee I did not make any error when manually categorizing the 432 cases.

So generally these mismatches are instances of:

Input error in software_name, full_quote, on_pdf_page: e.g. typos, etc. -> already fixed the labels ~49/432
Full_quote does not match tei_full_quote a. usually this is because when annotators copy texts directly from the formatted pdf and paste them into .ttl, the formatting of the original string goes wild. There could be additional or missing spaces in full_quote. ~163/432 b. "line break problem": There are dashes that link the two parts of a word split by the line break in pdfs. But in TEI XML they just do not exist. ~47/432 c. Special symbols are encoded inconsistently in TEI XML. e.g., diaersis, grave accent, superscript, curly quotes, trademarks. e.g., In TEI XML a registered trademark symbol "®" can be represented in different ways ~52/432
Software_name belongs to the dropped parts (e.g., abstract, Notes) so no mention matches the TEI XML ~3/432 Also see #512
Missing parts in TEI XML ~50/542 several subcategories at different levels of string absence: a. TEI XML has one single missing character, symbol, or longer strings within a sentence. b. TEI XML does not follow the text order in the original article (Sometimes heterogeneous content gets into a full sentence, splitting the software_name or the full_quote). c. The TEI XML is not in complete full text of the original article. i.e., missing paragraphs/sections d. Total conversion failure (not human-readable). This case is very rare but exists.
Existing coding i.e. the .ttl file does not match the article: missing pdf? ~7/432 I doubt this is because of pdf link rot?
Unknown reason ~43/432 Purely by eyes I cannot tell the difference between software_name, full_quote, and tei_full_quote. They look just the same.

With James's advice, I added tei_full_quote to all the 432 cases in .ttl, no matter it's the full_quote or tei_full_quote that looks weird. For the cases of which the tei_full_quote that looks weird, I haven't modified the software_name label in .ttl yet. As I am not sure which one to follow. Is there any automated ways to fix them? And may we have systematic ways to avoid these listed issues?

Definitely can give clearer clarifications or examples if these descriptions are dry

I am gonna check the mismatches of other fields in software mentions in the next few days.

howisonlab / softcite-dataset

Issues for aligning mention and context strings #507