inukshuk / anystyle

Fast citation reference parsing
https://anystyle.io
Other
1.05k stars 90 forks source link

Training not taking effect #70

Closed dominic-sps closed 7 years ago

dominic-sps commented 7 years ago

My Training file

<author>Haile WB, Gavegnano C, Tao S, Jiang Y, Schinazi RF, Tyor WR:</author> <title>The Janus kinase inhibitor ruxolitinib reduces HIV replication in human macrophages and ameliorates HIV encephalitis in a murine model.</title> <journal>Neurobiology of disease</journal> <date>2016,<date> <volume>92(Pt B):</volume><pages>137-143.</pages>
<author>DerSimonian R, Laird N.</author> <title>Meta-analysis in clinical trials revisited.</title> <journal>ontemp Clin Trials</journal> <date>2015;<date><volume>45(Pt A):</volume><pages>139-145.</pages>
<author>de la Tremblaye PB, Linares NN, Schock S, Plamondon H</author> <date>(2016)</date> <title>Activation of CRHR1 receptors regulates social and depressive-like behaviors and expression of BDNF and TrkB in mesocorticolimbic regions following global cerebral ischemia.</title> <journal>Exp Neurol</journal> <volume>284 (Pt A):</volume><pages>84-97.</pages>

I tried to append the above training data with the existing train.txt file and processed separately as well. But the following doesn't give proper match.

require 'anystyle/parser'
#refStr= "Crompton M. The mitochondrial permeability transition pore and its role in cell death. Biochem J. 1999; 341: (Pt 2)233-249"
refStr= "Haile WB, Gavegnano C, Tao S, Jiang Y, Schinazi RF, Tyor WR: The Janus kinase inhibitor ruxolitinib reduces HIV replication in human macrophages and ameliorates HIV encephalitis in a murine model. Neurobiology of disease 2016, 92(Pt B):137-143."
puts Anystyle.parse(refStr, :hash)

It matches sometime when I insert spaces around. Not sure why it is not matching when I used the same line for training as well as testing. Appreciate any suggestion in this regard.

dominic-sps commented 7 years ago

Thank you for your comment. I tried again with spaces around the training tags.

<author> Haile WB, Gavegnano C, Tao S, Jiang Y, Schinazi RF, Tyor WR: </author> <title> The Janus kinase inhibitor ruxolitinib reduces HIV replication in human macrophages and ameliorates HIV encephalitis in a murine model. </title> <journal> Neurobiology of disease </journal> <date> 2016, <date> <volume> 92(Pt B): </volume> <pages> 137-143. </pages>
<author> DerSimonian R, Laird N. </author> <title> Meta-analysis in clinical trials revisited. </title> <journal> ontemp Clin Trials </journal> <date> 2015; <date> <volume> 45(Pt A): </volume> <pages> 139-145. </pages>
<author> de la Tremblaye PB, Linares NN, Schock S, Plamondon H </author> <date> (2016) </date> <title> Activation of CRHR1 receptors regulates social and depressive-like behaviors and expression of BDNF and TrkB in mesocorticolimbic regions following global cerebral ischemia. </title> <journal> Exp Neurol </journal> <volume> 284 (Pt A): </volume> <pages> 84-97. </pages>

Somehow I am getting the same (hash) results.

{:author=>"Haile WB, Gavegnano C, Tao S, Jiang Y, Schinazi RF, Tyor WR:", :title=>"The Janus kinase inhibitor ruxolitinib reduces HIV replication in hman macrophages and ameliorates HIV encephalitis in a murine model.", :journal=>"Neurobiology of disease", :date=>"2016,", :volume=>"92(Pt", :pages=>"B):137-143."}

inukshuk commented 7 years ago

The problem with volume and pages is that the training data does not reflect what the parser sees after tokenization: by default, spaces are used as token boundaries; your training data includes:

<volume> 92(Pt B): </volume> <pages> 137-143. </pages>

But the tokens generated from your input will be: 92(Pt and B):137-143.. We already have some special case tokenization rules for volume / page numbers (because they are often not separated by spaces) but I guess they don't catch this case. That's something we should be able to fix.

Apart from that, spaces between or around the tags in your training file should not make a difference I believe (if they do that's something we should fix separately).

inukshuk commented 7 years ago

Please note, that you can configure the token separator by setting Anystyle.parser.options[:separator].

For example, for your input, I could change the token separator then parse your input in a way that ensures the tokens parsed match the tokens you supplied as training data:

>> Anystyle.parser.options[:separator] = /\s+|\b([^\s]+:)/
>> Anystyle.parse("Haile WB, Gavegnano C, Tao S, Jiang Y, Schinazi RF, Tyor WR: The Janus kinase inhibitor ruxolitinib reduces HIV replication in human macrophages and ameliorates HIV encephalitis in a murine model. Neurobiology of disease 2016, 92(Pt B):137-143.", :hash)
=> [{:author=>"Haile WB, Gavegnano C, Tao S, Jiang Y, Schinazi RF, Tyor WR:", :title=>"The Janus kinase inhibitor ruxolitinib reduces HIV replication in human macrophages and ameliorates HIV encephalitis in a murine model.", :journal=>"Neurobiology of disease", :date=>"2016,", :volume=>"92(Pt B):", :pages=>"137-143."}]
philgooch commented 7 years ago

Deleted my unhelpful comments, I wasn't aware of the token separator option, thanks @inukshuk :)

dominic-sps commented 7 years ago

Actually I tried bibtex and content is missing in many cases. Hence I am using hash (or tags) as suggested in another thread.

Attached some samples that use Pt X for issue. I have more samples if needed.

Pt.txt

inukshuk commented 7 years ago

Like I said above, the main issue you're seeing is that in your examples the issue and page number segments are not separated by a space: that means the parser will treat them as a single token by default. It is not possible to infer tokenization rules from the training data that's why you need to adjust the default token separator pattern like in the example I posted above. Once the token separator works for your input I suspect you'd need only a handful of references for training, because the Pt X) tokens should be easy to recognize.

We can consider adjusting the default token separator for everyone, but I'd be careful because it may have adverse effects on other data sets. That's the reason why the separator is an option that's easy to change.

dominic-sps commented 7 years ago

@inukshuk : Like you mentioned, the below setting works fine for my earlier sample. Anystyle.parser.options[:separator] = /\s+|\b([^\s]+:)/

However, setting it as a common parser setting had few other effects on the output of my few other samples. The ":" might come in many places including title and doi etc. In the worst case I thought of doing a pre-processing step for my input before passing it to AP. It might defeat the machine learning concept. Or post processing on the AP output.

inukshuk commented 7 years ago

Yes, the pattern was just a quick example to make my point : ) I think we could probably come up with a safer pattern that still works, but causes less disruption elsewhere.

That said, you're right that you could either do pre- or post-processing. Pre-processing is harder in my experience, because any form of pattern replacement can distort other parts of your references. Post-processing is basically what our 'normalizers' do if you use formats just as citeproc or bibtex: they're much easier to write because you can focus on a few selected segments returned by the machine learning process. In your case, you could train the parser to detect the combined issue and page number and then use a normalizer (or post-processor) to parse out the relevant information from those segments only.