mjwestgate / synthesisr

Data import and deduplication for evidence synthesis projects
30 stars 10 forks source link

read_refs() incorrectly splitting abstract over multiple fields #18

Open nealhaddaway opened 2 years ago

nealhaddaway commented 2 years ago

read_refs() is incorrectly splitting the abstract in this record across multiple fields: 10.1111/j.1469-8137.2004.01201.x

nealhaddaway commented 2 years ago

I believe it's this line x <- gsub(",(?=\\s[[:alpha:]]{2,})", " and ", x, perl = TRUE) (line 18) of 'clean_functions.R': https://github.com/mjwestgate/synthesisr/blob/master/R/clean_functions.R

chriscpritchard commented 2 years ago

I'm looking into this, but I can't seem to replicate the issue. Would you be able to upload an RIS file where this occurs?

nealhaddaway commented 2 years ago

I've emailed you a file - don't think I can publish it here..

chriscpritchard commented 2 years ago

I've had a look and I still can't replicate this.

The abstract is all in the abstract:

r$> x <- read_refs("C:\\Users/chris/Downloads/references-problem.ris")
r$> View(x)

gives me:

"Contents I. Introduction 2 II. Carbon in temperate grasslands 2 III. The process of carbon sequestration in soils 4 IV. Tracking carbon movement 9 V. Models of soil carbon dynamics 10 VI. Management effects on carbon sequestration 11 VII. Climate-change effects on carbon sequestration 12 VIII. Response to elevated CO2 13 IX. Conclusions 14 References 14 Summary The substantial stocks of carbon sequestered in temperate grassland ecosystems are located largely below ground in roots and soil. Organic C in the soil is located in discrete pools, but the characteristics of these pools are still uncertain. Carbon sequestration can be determined directly by measuring changes in C pools, indirectly by using 13C as a tracer, or by simulation modelling. All these methods have their limitations, but long-term estimates rely almost exclusively on modelling. Measured and modelled rates of C sequestration range from 0 to > 8 Mg C ha�\210�1 yr�\210�1. Management practices, climate and elevated CO2 strongly influence C sequestration rates and their influence on future C stocks in grassland soils is considered. Currently there is significant potential to increase C sequestration in temperate grassland systems by changes in management, but climate change and increasing CO2 concentrations in future will also have significant impacts. Global warming may negate any storage stimulated by changed management and elevated CO2, although there is increasing evidence that the reverse could be the case."

Might be helpful for you to demo for me or explain the exact steps to replicate the bug?

nealhaddaway commented 2 years ago

Hmm weird.. all I do is synthesisr::read_refs(file.choose()) and select that references.ris.

Could be I’ve not updated synthesisr for a while. I’ll try again tomorrow.

Sent from my iPhone

On 16 Mar 2022, at 21:37, Chris Pritchard @.***> wrote:



I've had a look and I still can't replicate this.

The abstract is all in the abstract:

r$> x <- read_refs("C:\Users/chris/Downloads/references-problem.ris") r$> View(x)

gives me:

"Contents I. Introduction 2 II. Carbon in temperate grasslands 2 III. The process of carbon sequestration in soils 4 IV. Tracking carbon movement 9 V. Models of soil carbon dynamics 10 VI. Management effects on carbon sequestration 11 VII. Climate-change effects on carbon sequestration 12 VIII. Response to elevated CO2 13 IX. Conclusions 14 References 14 Summary The substantial stocks of carbon sequestered in temperate grassland ecosystems are located largely below ground in roots and soil. Organic C in the soil is located in discrete pools, but the characteristics of these pools are still uncertain. Carbon sequestration can be determined directly by measuring changes in C pools, indirectly by using 13C as a tracer, or by simulation modelling. All these methods have their limitations, but long-term estimates rely almost exclusively on modelling. Measured and modelled rates of C sequestration range from 0 to > 8 Mg C ha�\210�1 yr�\210�1. Management practices, climate and elevated CO2 strongly influence C sequestration rates and their influence on future C stocks in grassland soils is considered. Currently there is significant potential to increase C sequestration in temperate grassland systems by changes in management, but climate change and increasing CO2 concentrations in future will also have significant impacts. Global warming may negate any storage stimulated by changed management and elevated CO2, although there is increasing evidence that the reverse could be the case."

Might be helpful for you to demo for me or explain the exact steps to replicate the bug?

— Reply to this email directly, view it on GitHubhttps://github.com/mjwestgate/synthesisr/issues/18#issuecomment-1069660416, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKOBNXA5BFQGDKODGNBWNZDVAJICVANCNFSM5QPTPVEA. You are receiving this because you authored the thread.Message ID: @.***>

nealhaddaway commented 2 years ago

Do you not get three additional columns (investigator, IV and XI I think)?

Sent from my iPhone

On 16 Mar 2022, at 21:40, Neal Haddaway @.***> wrote:

 Hmm weird.. all I do is synthesisr::read_refs(file.choose()) and select that references.ris.

Could be I’ve not updated synthesisr for a while. I’ll try again tomorrow.

Sent from my iPhone

On 16 Mar 2022, at 21:37, Chris Pritchard @.***> wrote:



I've had a look and I still can't replicate this.

The abstract is all in the abstract:

r$> x <- read_refs("C:\Users/chris/Downloads/references-problem.ris") r$> View(x)

gives me:

"Contents I. Introduction 2 II. Carbon in temperate grasslands 2 III. The process of carbon sequestration in soils 4 IV. Tracking carbon movement 9 V. Models of soil carbon dynamics 10 VI. Management effects on carbon sequestration 11 VII. Climate-change effects on carbon sequestration 12 VIII. Response to elevated CO2 13 IX. Conclusions 14 References 14 Summary The substantial stocks of carbon sequestered in temperate grassland ecosystems are located largely below ground in roots and soil. Organic C in the soil is located in discrete pools, but the characteristics of these pools are still uncertain. Carbon sequestration can be determined directly by measuring changes in C pools, indirectly by using 13C as a tracer, or by simulation modelling. All these methods have their limitations, but long-term estimates rely almost exclusively on modelling. Measured and modelled rates of C sequestration range from 0 to > 8 Mg C ha�\210�1 yr�\210�1. Management practices, climate and elevated CO2 strongly influence C sequestration rates and their influence on future C stocks in grassland soils is considered. Currently there is significant potential to increase C sequestration in temperate grassland systems by changes in management, but climate change and increasing CO2 concentrations in future will also have significant impacts. Global warming may negate any storage stimulated by changed management and elevated CO2, although there is increasing evidence that the reverse could be the case."

Might be helpful for you to demo for me or explain the exact steps to replicate the bug?

— Reply to this email directly, view it on GitHubhttps://github.com/mjwestgate/synthesisr/issues/18#issuecomment-1069660416, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKOBNXA5BFQGDKODGNBWNZDVAJICVANCNFSM5QPTPVEA. You are receiving this because you authored the thread.Message ID: @.***>

chriscpritchard commented 2 years ago

Just checked - I get those columns when using the cran version, it appears to be fixed in master, perhaps in c406bc9f92c56dbcfb52700a8e0c81ab36cd2c01.

nealhaddaway commented 2 years ago

Aaah cool. Grand! I’ll pull in from GitHub next time then.

Martin may want a ping to push this to CRAN then, I guess?

Sent from my iPhone

On 16 Mar 2022, at 21:45, Chris Pritchard @.***> wrote:



Just checked - I get those columns when using the cran version, it appears to be fixed in master, perhaps in c406bc9https://github.com/mjwestgate/synthesisr/commit/c406bc9f92c56dbcfb52700a8e0c81ab36cd2c01.

— Reply to this email directly, view it on GitHubhttps://github.com/mjwestgate/synthesisr/issues/18#issuecomment-1069666647, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKOBNXDWUUDCC5HWMCGMV6DVAJJABANCNFSM5QPTPVEA. You are receiving this because you authored the thread.Message ID: @.***>

nealhaddaway commented 2 years ago

This is happening with a different file using the GitHub version still - see this example EMBASE file: https://gitlab.com/extending-the-earcheck/living-review/-/blob/master/search/literature_search_02/Embase_290521_N974.RIS?expanded=true&viewer=simple

It reads in across 30 columns, which shouldn't all be there.