jpwahle / cs-insights-crawler

This repository implements the interaction with DBLP, information extraction and pre-processing of papers, and a client to store data to the cs-insights-backend.
https://aclanthology.org/2022.lrec-1.283.pdf
Apache License 2.0
10 stars 1 forks source link

Step 1: Create a general pipeline to retrieve the complete dataset #2

Closed jpwahle closed 3 years ago

jpwahle commented 3 years ago

The NLP scholar dataset only seems to have the titles and no other text with contents of the papers (i.e. abstracts). We want to complete the dataset by adding the abstracts for each paper.

trannel commented 3 years ago

For completion I'd like to add that Jan and I talked when I collected the disc and decided how we would save the abstracts and incorportate them into the NLP Scholar dataset. We would add the abstracts into the already existing dataset file, by adding them at the end for each paper. Should this take up too much space and make the file unusable we have to figure something else out.

trannel commented 3 years ago

I also put the NLP Scholar data into the repo. I'm not sure if this is an issue because of distribution etc., once this repo goes public. Should I remove the data and instead use it by putting a path in the .env?

jpwahle commented 3 years ago

I think we should not include the NLP scholar data here because including the abstracts will get large. However, there are also advantages to storing files in Git (e.g., with Git LFS) to version the dataset. In that case, it makes a lot of sense. This would require us to push different versions of the dataset with the code that created it. If versioning the dataset is something you consider important, we can do it. Maybe we can take a look here what fits our situation best. If we later want to release the full dataset including pdfs and full texts, Zenodo is a good choice.

trannel commented 3 years ago

I agree then, that we should not version the dataset, at least for now. Later on we can think about storing it with Git LFS or something similar.

trannel commented 3 years ago

I started writing the code to extract the abstracts and ran into the issue of the PDFs having some quality issues, which was not really unexpected. The older the papers are, the harder it is to extract the abstracts for various reasons. Also the papers from some venues like to not follow a strict template. While I am still downloading the papers (I did not implement multiprocessing and my internet is slow anyway) I have all papers from 2010. I did a test run for those and of the 2623 papers, 831 are from one of the top-tier conferences (ACL, EMNLP, NAACL, COLING, and EACL) and of those only 3 could not pinpoint the abstract with some simple rules because there were encoding issues.

I would suggest I focus on extracting the abstracts from those top-tier conferences and for the years 2010-2020 for my analysis. I will try and see how far back we can go before they change the template and what other venues have a stricter template in that time and will also try to include those. Trying to write an extractor for all papers for all venues for 1965-2020 is a project in itself I think. But what do you think?

jpwahle commented 3 years ago

I totally agree. We should focus on a smaller time frame to get as many abstracts as possible. So let's use 10 years (2010-2020).

truas commented 3 years ago

Great idea guys! 10 years is a reasonable time frame and we can focus on the actual extraction. Not that the previous years are not important, but if we do a good job within this window it's a really good contribution.

trannel commented 3 years ago

So I tested some stuff with AA and here are the results: The python API AA has is not Windows compatible (but I got it to work) and setting it up for reproduction purposes is not easy. So I simply took the XML and went through it myself with lxml. Of our 52k paper we were able to get 17271 abstracts. Slight encoding issues included, though the html tags (italics, etc.) should be gone. I haven't looked into that too much yet.

AA is now also 63k or more papers large, they also added EAMT and some other venues, which are missing in NLP Scholar. It appears the newly added venues also have abstracts. IIRC there were 25k abstracts total in AA.

For cleanup of the AA abstracts we also have to consider citations like Lexical Functional Grammars (LFG’s) [1]. and other weird punctuation like the concept """"enhanced publication"""" and its scientific value, 3. the """"fragment fitter tool"""", a language processing tool which come from the rounded quotation marks in the abstracts. Regarding the encoding, in the dataset .txt some weird characters are still there, but this happens when I paste them here: identify a word’s meaning. So even AA is not without encoding issues.

I have not tested grobid yet, but AA abstract extraction works.

jpwahle commented 3 years ago

Great progress Lennart! I am quite surprised that AA has encoding issues in their abstracts because authors usually have to fill this field in a UTF-8 encoded manner. Also, I am surprised, that "only" 17k of 52k abstracts are available there (is that correct, or did I misunderstand?).

For the additional venues, let's not worry about it too much right now. We can always add them later.

Let's test grobid as we talked about in the last meeting to look at some examples manually that deviate the most. Maybe you can already send some of those "weird" cases you encountered to us so we can have a look (also the current ones you mentioned from AA).

trannel commented 3 years ago

Yes, we only have 17k abstracts out of the 52k papers in NLP Scholar. I matched them using the ID and the only ones it could not find were the new ones. I believe works fine there.

For the weird cases, here are some examples. These also appear in the brower for me, so I am not sure if it is really an Windows issue or AA has errors: 1) Search for """" in here for example: https://github.com/acl-org/acl-anthology/blob/master/data/xml/L12.xml 2) Search for identify a word in here: https://github.com/acl-org/acl-anthology/blob/master/data/xml/L08.xml In pycharm the following chharcter PU2. The same file also has STS characters, search for We call this data set as. 3) In the NLP Scholar .txt there are also some NBSP characters, e.g. for the paper P19-1108 (not in the abstract though) 4) On the other hand Avyayībhāva, Tatpuruṣa, Bahuvrīhi and Dvandva is in one abstract which works fine (W16-3701, https://github.com/acl-org/acl-anthology/blob/master/data/xml/W16.xml). This might be a nice abstract to test some other encodings later on as in the abstract Adṣṭādhyāyī looks weirdly formatted in pycharm, but does not have a �. It contains \n though.

These were the ones I was able to find thus far, but there might be more.

trannel commented 3 years ago
An update regarding grobid: running it on Windows was a chore (again), but it works. This post includes some IDs for papers, why they were/are an issue and what grobid did. If you want to check the papers yourself, I recommend searching for the ID in the NLP Scholar .txt and then clicking/copying the link. The results are from both the lightweight and heavy grobid model. The heavy one is supposed to be a bit more precise, but in these cases I was not able to find any improvements. ID Issue grobid
W03-2202 Content part in tika is None Persists: Empty content
2020.acl-main.639 and C10-1014 Not able to extract anything except some � Persists
C10-2103 The F in F-measure is missing Persists, also precisionoriented is now in the abstract. The hyphen was a line break in the paper, grobid puts it together.
N10-1052 fi in efficiently is � Persists: ef ciently, also short-and longdistance, again hyphen at line break removed.
L12-1-38 quotations marks become """" in AA abstract Fixed (also in tika)
L08-1-276 Apostrophe ' not correctly encodes in AA abstract Fixed (also in tika)
W16-3701 Many non-Ascii letters, tika: Avyayı̄bhāva, Tatpurus.a, Bahuvrı̄hi and Dvandva Persists: Avyayībhāva, Tatpurus . a, Bahuvrīhi and Dvandva
C12-1015 and C12-1104 and C12-1132 Abstracts in multiple languages, keywords at the end. Fixed: Finds the correct abstract in the correct length. Also does not include the keywords.
W89-0222 and C65-1001 OCR used Persists: Still far from perfect. Different issues as tika
S01-1104 No easy boundaries for ruleset Persists: No abstract found
C10-2174 Abstract spans 2 columns, with text at the bottom Fixed (also in tika)
C08-1006 Abstract spans 2 columns, license part of first column. In rule-based extraction part of the abstract. Persists: cutoff after first column, before the license.

Conclusion: For older papers we cannot really use either option without good OCR beforehand. Regarding the encoding grobid performs as well as tika and has the same issues (Maybe this is my PC?). The current ruleset is also pretty much as good as grobid. The issues there are similar. What is worse is grobid is removing the hyphens at a line break, which is now out of our control.

It might be more robust against other weird templates we did not specifially account for, but I do not see any real adavantage in using grobid at this point. I'm not sure whether it is worth it to look into CERMINE, the other recommendation by Norman.

jpwahle commented 3 years ago

I would say, let's use tika then. We can always remove those articles later and use only a clean subset if we see there is too much noise (which I think won't be even necessary).

Nice progress Lennart, keep it rockin' 🚀