jacksongoode / NIME-proceedings-analyzer

A tool written in Python to perform a bibliographic analysis of the NIME proceedings archive and other similar corpora.
GNU General Public License v3.0
7 stars 3 forks source link

Author info missing when analyzing papers published in PubPub #6

Open stefanofasciani opened 1 year ago

stefanofasciani commented 1 year ago

The analyzer fails in getting author information for paper published in PubPub, and therefore al location-based analysis fails.

The current PubPub publishing interface and NIME author guideline does not guarantee consistent and complete author information in the NIME proceedings (indeed, looking at 2021 papers, we have paper with author names only, cases with author names and affiliation, and cases in which all "traditional" information is provided, such as name, affiliation, email).

Author information are partially hidden in PubPub (you have to click on the "show details" button on top right. However, the analyzer downloads the XML directly from PubPub, and only author names are included (additional info visible in show details are missing).

A possible workaround is to try downloading the PDF generated by PubPub, and proceed as for pre-2021 papers (use Grobid to process PDF files). However, PDF from PubPub are malformed (but can be fixed in the analyzer script).

At the time of opening this issue, the NIME paper bibtex file still does not include the 2022 papers. At some point, organizers of 2022 conference downloaded the Latex files from PubPub and used these to build paper PDF files with the traditional columns format.

Before making any modification to the proceeding analyzer it is important to understand what will be the current and future publishing format for NIME papers. If PDF papers will come back at some point (perhaps also for 2021), we can simply scrap the current handling of PubPub (or perhaps we can handle 2021 manually as an exception).

stefanofasciani commented 1 year ago

Also worth considering the fact that NIME 2023 will not use PubPub image

jacksongoode commented 1 year ago

Wow this would be a very significant change... And one that would hinder projects like this one. It seems a lot of the feedback has been the editing process and the PDF rendering, which I feel are complaints about style and traditional processes? But the decision has been made I assume? This shouldn't had any major impact with this project now - but I was hoping a digital/structured solution would enable non-machine learning parsing that would be immediate in the future.

stefanofasciani commented 1 year ago

2022 proceedings have beed added to the NIME bibtex file. Although proceedings are still stored in PubPub, the 2022 bibtex entries are different (the URL fields contains the DOI and no longer the string "pubpub"). This can be easily fixed changing line 277 of pa_extract.py to "if 'pubpub' or 'doi.org' in pub['url']:". However, we will still suffer from the same problem (i.e. we cannot fetch author information).

Since pubpib may lo longer be used in future, we can opt to consider 2021 and 2022 as "exceptions", and manually download and store the PDF in the repository.

stefanofasciani commented 1 year ago

Furthermore, the following code to download XML from PubPub (in pa_load.py) lo longer works:

                if pub['puppub'] and '.xml' not in url:
                    url = re.search(r"jats","url":"(.*?.xml)", r.text).group(1)
                    r = session.get(url)
                open(dl_path + fn, 'wb').write(r.content)

In particular, it seems that PubPup blocks the download attempt recognizing that there is not a human+browser on the other side. Indeed the downloaded XML does not include any paper-related info, but the following (plus some other info I did not check).

    <div id="challenge-body-text" class="core-msg spacer">
        assets.pubpub.org needs to review the security of your connection before proceeding.
    </div>

However, for 2022, traditional 2 columns pdf papers have been generated by the organizers (pubpub --> latex --> PDF) and are somehow hidden here https://www.nime.org/proceedings/2022/115.pdf (the last part of the path is the "pdf" field in the bibtex.