neuroquery / pubget

Collecting papers from PubMed Central and extracting text, metadata and stereotactic coordinates.
https://neuroquery.github.io/pubget/
MIT License
20 stars 12 forks source link

Preserve text related to public datasets #48

Closed agtic closed 1 month ago

agtic commented 2 months ago

This issue stems from recent conversations between @jeromedockes @koudyk @jbpoline @adelavega and myself.

The goal is to make it possible/easier to label text that specifies public datasets: both those that were downloaded and used in the paper’s analysis and those that were collected/created in the work described in the paper and deposited in a public repository.

In the examples we’ve looked at, these bits of text tend to be inside tags that pubget current strips away for readability. I've included two examples below from the same article.xml file from PMC9622880 which is also attached

Example 1 on Line 137, the dataset is referenced inside an <ext-link ext-link-type="uri"> tag:

These neural and behavioral data have been made publicly available as a large-scale database of autobiographical
 memory (<ext-link ext-link-type="uri" xlink:href="https://osf.io/exb7m/">https://osf.io/exb7m/</ext-link>)

Example 2 on line 339, the dataset is referenced inside a <notes notes-type="data-availability"> tag:

    <notes notes-type="data-availability">
      <title>Data availability</title>
      <p>Video features, memory features, and fMRI data generated in this study have been deposited in a 
repository on the Open Science Framework under access link <ext-link ext-link-type="uri" xlink:href="https://osf.io/exb7m/">https://osf.io/exb7m/</ext-link>. Raw memory videos and memory 
geocoordinate information are protected and are not available due to data privacy laws. The graph data 
generated in this study are provided in the Source Data file. <xref rid="Sec25" ref-type="sec">Source
 data</xref> are provided with this paper.</p>
    </notes>

Hopefully, this can be accomplished with some minor modifications to the text_extraction.xsl stylesheet.

I'll try to add a few more example articles to this issue when I can.

agtic commented 1 month ago

@Precious-Macaulay & @Digital365Staking I accidentally opened this issue using the wrong github account. Please resubmit your gitpay proposal to issue #51

Precious-Macaulay commented 1 month ago

@Precious-Macaulay & @Digital365Staking I accidentally opened this issue using the wrong github account. Please resubmit your gitpay proposal to issue #51

Got it!