jameshowison / softcite

Study of software citation in the biology literature
5 stars 5 forks source link

Gold standard handling removal of [] for references #6

Open jameshowison opened 9 years ago

jameshowison commented 9 years ago

How should I hadnle the situations where pdf_to_text has removed the square braces from references?

e.g.: 2004-46-NATURE-C01-mention:

The molecule is rich in proline residues (13%) and analysis of its amino acid sequence with PONDR21 indicates that in the absence of other viral components at least the N-terminal half of the subunit would be disordered.

Should I do anything to indicate that here the 21 was actually a reference? I don't know that stemming is going to work because some software does end in numbers. I wonder whether some tweak to the pdf_to_txt code might help us here?

yg4886 commented 9 years ago

For this case, maybe you could add the [] manually to sentence. And unfortunately, the pdf_to_text won’t help us here. Thanks

On Nov 20, 2014, at 11:23 AM, James Howison notifications@github.com wrote:

How should I hadnle the situations where pdf_to_text has removed the square braces from references?

e.g.: 2004-46-NATURE-C01-mention:

The molecule is rich in proline residues (13%) and analysis of its amino acid sequence with PONDR21 indicates that in the absence of other viral components at least the N-terminal half of the subunit would be disordered.

Should I do anything to indicate that here the 21 was actually a reference? I don't know that stemming is going to work because some software does end in numbers. I wonder whether some tweak to the pdf_to_txt code might help us here?

— Reply to this email directly or view it on GitHub https://github.com/jameshowison/softcite/issues/6.

jameshowison commented 9 years ago

Hmmm, but if I add those manually to the labeled ones, they won't be in the unlabelled sentences...that doesn't seem right to me. Looking at 2004-46-NATURE.pdf the PONDER21 thing comes from PONDER superscript 21. Not sure why pdf_to_text is ignoring the superscript.

Anyway, Can you think about whether correcting these but not correcting things in other sentences would be a problem?

yg4886 commented 9 years ago

Yes, I think you are right. The test dataset is generated by pdf_to_text too. So we need to use the manually labeled ones to predict the pdf_to_text generated sentences(test dataset) that still contain uncertainty. But, I am not sure, maybe the manual work would increase the accuracy.

And yes, the pdf_to_text is annoying. But I think there should always be some noises in the training dataset in practice. Sometime it is difficult to decrease the noise, i.e. in our project, the cost is expensive. (there may be cheap ways to do that. we can use AMT to transfer all the pdfs to txts). We’ll always train the model with data containing noise. And then try anything we can do to improve its performance.

And, also, Byron, do you have any comments?

-Yan

On Feb 20, 2015, at 5:49 PM, James Howison notifications@github.com wrote:

Hmmm, but if I add those manually to the labeled ones, they won't be in the unlabelled sentences...that doesn't seem right to me. Looking at 2004-46-NATURE.pdf the PONDER21 thing comes from PONDER superscript 21. Not sure why pdf_to_text is ignoring the superscript.

Anyway, Can you think about whether correcting these but not correcting things in other sentences would be a problem?

— Reply to this email directly or view it on GitHub https://github.com/jameshowison/softcite/issues/6#issuecomment-75340714.