knmnyn / ParsCit

An open-source CRF Reference String Parsing Package
http://wing.comp.nus.edu.sg/parsCit
GNU Lesser General Public License v3.0
155 stars 47 forks source link

Line-structure for training / size of models #2

Closed adibaba closed 13 years ago

adibaba commented 13 years ago

Hello ParsCit-team,

while I was trying to train ParsCit, two questions arose:

(I) Which line-structure do you suggest for tagged data? In the tagged data of citeseerx [1], I found some lines like: <title> ... component analysis, </title> In: <booktitle> Proc. of ... </booktitle> Are strings between tags working fine, and if yes, it is working for all signs or only for strings separated by spaces? It would be great to specify fields like <title> My thesis </title>. instead of <title> My thesis. </title> to get metadata without special chars.

(II) Why is the size of the current default model so small? The file 'parsCit.allData.100401.model' In the resources directory [2] is only 5 MB. This is less than the file 'parsCit.coraOnly.model' (9.4 MB) or parsCit.allData.090625b.model (23 MB).

Best regards, Adrian

[1] http://aye.comp.nus.edu.sg/parsCit/citeseerx.tagged.txt [2] https://github.com/knmnyn/ParsCit/tree/master/resources

knmnyn commented 13 years ago

Hi Adibaba:

1a) The tagged data should consist of all tokens delimited and assigned to an XML tag in your tag inventory.
1b) You can post process the metadata to remove special characters. Currently ParsCit does not do much post processing itself.

2) A smaller model turned out to be more optimal for the larger data. We use different training parameters on the different sets of data to achieve the best accuracy for the data set in question. In the larger data set case, perhaps the model with fewer features (hence smaller size) turns out to be best. While not necessarily optimal, we have just done some simple experiments to validate that the model works well. If you find a different training works better than the current model and are willing to share with us how you did it, we'd be grateful to incorporate your suggestions.

adibaba commented 13 years ago

Hello Min,

thanks again for your quick reply.

1a) Okay. Good to know :)

1b) Yes, we we already thought and planned to implement a post-process of the data.

2) At the moment, we use an initial tagged data set of Cora, CiteSeerX, FLUX-CiM and humanities (English). While parsing papers of a conference, we look for errors and use the matching raw reference strings to generate additional tagged data. A good start set should be a mix of different citation systems and heterogeneos data. Is the file of tagged data you use available online?

We are developing components to analyze and visualize data of publications at a project group at the University of Paderborn. If we produce usable results, we'll share them with you.

Best regards, Adrian

knmnyn commented 13 years ago

2) I'm not sure what you mean by available online. The file (citeseer.tagged.txt) is part of the distribution and the distribution is LGPL so it is available online. The other three chunked training files are similarly available in the distribution. I think I'm not understanding the core part of your question.

Yes, agreed, the set should be a representative mix of citation systems used in your production data (or expect to find).

We'd love to hear about your project, especially if you have papers on the subject. I think many groups (including ours) are trying to use ParsCit and other similar software to do projects similar to what you suggest.

-Min

adibaba commented 13 years ago

Ah, I was not sure, which file you use for the current model. Okay, it's citeseer.tagged.txt.