kermitt2 / entity-fishing

A machine learning tool for fishing entities
http://nerd.readthedocs.io/
Apache License 2.0
249 stars 24 forks source link

Need help to checkout, corupt data? #120

Closed Slashdacoda closed 3 years ago

Slashdacoda commented 3 years ago

image

image

kermitt2 commented 3 years ago

Hello @Slashdacoda !

I've just tried and have seen no problem:

lopez@work:~/tmp$ git clone https://github.com/kermitt2/entity-fishing
Cloning into 'entity-fishing'...
remote: Enumerating objects: 511, done.
remote: Counting objects: 100% (511/511), done.
remote: Compressing objects: 100% (291/291), done.
remote: Total 14526 (delta 222), reused 375 (delta 148), pack-reused 14015
Receiving objects: 100% (14526/14526), 611.58 MiB | 1.98 MiB/s, done.
Resolving deltas: 100% (6480/6480), done.
Updating files: 100% (2798/2798), done.
lopez@work:~/tmp$ 

What's your OS and version of git?

You can always try with the zip, you might be luckier, e.g.:

wget https://github.com/kermitt2/entity-fishing/archive/refs/heads/master.zip
Slashdacoda commented 3 years ago

Hey @kermitt2

Win 10pro () image

A bit troubleshooting:

After updating to 2.31.1 (https://gitforwindows.org), still same in git bash: image

On Windows Terminal: image

After installing Cygwin 2.905 (64 bit): image

I think this is an Windows/Filesystem related problem: https://brendanforster.com/notes/fixing-invalid-git-paths-on-windows/

Some character problem. In my case the msg is: image

The fix should be related to some path related character, maybe:

https://github.com/kermitt2/entity-fishing/blob/master/data/corpus/corpus-long/wikipedia/RawText/Alfred_Conkling_Coxe%2C_Sr.

why this %2C >> , in a filename?

Slashdacoda commented 3 years ago

Update: on other pc with windows 10 it works, thats wierd^^

Never the less, i will try this steps to fix my enviroment: https://brendanforster.com/notes/fixing-invalid-git-paths-on-windows/

Slashdacoda commented 3 years ago

Ok, after all, the problem semes to be the last point in the name of the files.

My enviroment can't find the 2 files with this nameshema. I figure it out at the point on recommiting the changed filename:

image

A posible solution is renaming it without a dot at the end of the name. Following this propose i ask myself if

  1. is it enough to rename it, or did we have to chane other things on other section of the project?
  2. why only my enviroment has problems with this nameshema "blalb.c." > identified as an C file
kermitt2 commented 3 years ago

Hello !

The data you are pointing to come from an external evaluation corpus "Wikipedia" created by:

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In Zoubin Ghahramani, editor, Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, volume 227 of ACM International Conference Proceeding Series, pages 129–136. ACM. DOI <https://doi.org/10.1145/1273496.1273513>.

and they use the Wikipedia article name as file name - bad practice for file portability, but it's not our choice.

The file names are referenced in data/corpus/corpus-long/wikipedia/wikipedia.xml, @docName, that's it.

I guess there is no problem to rename these files (this corpus is not very useful beyond old system comparison, and is not updated), just be sure to rename them also in the corresponding wikipedia.xml for consistency... PR welcome ! :)

Slashdacoda commented 3 years ago

The checkout problem should be fixed, thx for the support @kermitt2