Closed Tooa closed 8 years ago
Yes, I also though of this paper. This is, I think, the original paper that introduced the concept of truecasing. Why not to try this established approach.
Do they have a maven dependency?
No, they don't offer a maven dependency. We also have several other libraries that don't have a maven dependency too such as JavaML or GermanNE tagger. In case of NoD, I solved the problem by storing the builds in my Dropbox. Travis will then fetch the jars
with a bash script. This solution is kind of hacky though.
I thought of using git LFS [1] for these kind of dependencies. But this will not work out for the free plan, because bandwidth is limited to 1GB per month. However, Travis resolves these dependencies per build.
We could probably buy 50GB storage (with 50GB bandwidth) for 5$ per month. But even this solution would not be enough as the NE-Tagger is about 200MB with bundled models. We may fork such repositories and use the solution mentioned in [2]. Nevertheless, we have to investigate and find a appropriate solution. I will open another ticket to find a general solution to this problem.
[1] https://help.github.com/articles/billing-plans-for-git-large-file-storage/ [2] http://stackoverflow.com/questions/8871056/can-i-use-a-github-project-directly-in-maven
Probably the better solution would be something like a custom place for artifacts. Can be just a Github repo, or BinTray or Artifactory.
5gb per month seems totally reasonable.
check this solution: https://github.com/tudarmstadt-lt/chinese-whispers/blob/master/pom.xml
it also can work
<repositories>
<repository>
<id>johannessimon-mvn-repo</id>
<url>https://raw.github.com/johannessimon/mvn-repo/master</url>
<snapshots>
<enabled>true</enabled>
<updatePolicy>always</updatePolicy>
</snapshots>
</repository>
</repositories>
Yes, we have to investigate. Let us move this discussion to another issue, because it is not related to this ticket.
Update: Drawback of using a snapshot repository is the 100 MB GitHub size limit per file ~ NER Tagger 200 MB
agree!
I suggest to use the algorithm described in [1]. They report 98% of accuracy on news articles. An implementation can be found here [2]. Code is licensed under Apache License 2.0.
[1] http://www.cs.cmu.edu/~llita/papers/lita.truecasing-acl2003.pdf [2] https://github.com/stefano-bragaglia/TrueCase