tudarmstadt-lt / newsleak

Science and Data-Driven Journalism: Data Extraction and Interactive Visualization of Unexplored Textual Datasets for Investigative Data-Driven Journalism
http://newsleak.io
GNU Affero General Public License v3.0
9 stars 4 forks source link

Implement truecasing for PlusD dataset #1

Closed Tooa closed 8 years ago

Tooa commented 9 years ago

I suggest to use the algorithm described in [1]. They report 98% of accuracy on news articles. An implementation can be found here [2]. Code is licensed under Apache License 2.0.

[1] http://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf [2] https://github.com/stefano-bragaglia/TrueCase

alexanderpanchenko commented 9 years ago

Yes, I also though of this paper. This is, I think, the original paper that introduced the concept of truecasing. Why not to try this established approach.

alexanderpanchenko commented 9 years ago

Do they have a maven dependency?

Tooa commented 9 years ago

No, they don't offer a maven dependency. We also have several other libraries that don't have a maven dependency too such as JavaML or GermanNE tagger. In case of NoD, I solved the problem by storing the builds in my Dropbox. Travis will then fetch the jars with a bash script. This solution is kind of hacky though.

I thought of using git LFS [1] for these kind of dependencies. But this will not work out for the free plan, because bandwidth is limited to 1GB per month. However, Travis resolves these dependencies per build.

We could probably buy 50GB storage (with 50GB bandwidth) for 5$ per month. But even this solution would not be enough as the NE-Tagger is about 200MB with bundled models. We may fork such repositories and use the solution mentioned in [2]. Nevertheless, we have to investigate and find a appropriate solution. I will open another ticket to find a general solution to this problem.

[1] https://help.github.com/articles/billing-plans-for-git-large-file-storage/ [2] http://stackoverflow.com/questions/8871056/can-i-use-a-github-project-directly-in-maven

alexanderpanchenko commented 9 years ago

Probably the better solution would be something like a custom place for artifacts. Can be just a Github repo, or BinTray or Artifactory.

alexanderpanchenko commented 9 years ago

5gb per month seems totally reasonable.

check this solution: https://github.com/tudarmstadt-lt/chinese-whispers/blob/master/pom.xml

it also can work

<repositories>
        <repository>
            <id>johannessimon-mvn-repo</id>
            <url>https://raw.github.com/johannessimon/mvn-repo/master</url>
            <snapshots>
                <enabled>true</enabled>
                <updatePolicy>always</updatePolicy>
            </snapshots>
        </repository>
    </repositories>

https://github.com/johannessimon/mvn-repo

Tooa commented 9 years ago

Yes, we have to investigate. Let us move this discussion to another issue, because it is not related to this ticket.

Update: Drawback of using a snapshot repository is the 100 MB GitHub size limit per file ~ NER Tagger 200 MB

alexanderpanchenko commented 9 years ago

agree!