jedgusse / project_lorenzo

0 stars 0 forks source link

Data in github #2

Closed emanjavacas closed 7 years ago

emanjavacas commented 7 years ago

I was thinking if we really need to have the data git-tracked, the main drawback being the very costly commits that each preprocessing of the source files will incur. Perhaps we should use the server for storing the data and the repo for code (both preprocessing and experiment code).

What do you think @jedgusse, @mikekestemont ?

mikekestemont commented 7 years ago

Yes, I would preprocessing scripts in the repo, but I wouldn't git track the data.

Prof. Dr. Mike Kestemont | www.mike-kestemont.org | Twitter: @Mike_Kestemont | mike.kestemont@uantwerp.be | mike.kestemont@gmail.com | University of Antwerp | City Campus, Prinsstraat 13, room D. 118 I B-2000 Antwerp, Belgium | tel. +32 (0)3 265.42.54

Check out our documentary on Digital Humanities and Hildegard of Bingen: watch it in HD on Vimeo: https://vimeo.com/70881172

On Wed, Apr 26, 2017 at 3:41 PM, Enrique Manjavacas < notifications@github.com> wrote:

I was thinking if we really need to have the data git-tracked, the main drawback being the very costly commits that each preprocessing of the source files will incur. Perhaps we should use the server for storing the data and the repo for code (both preprocessing and experiment code).

What do you think @jedgusse https://github.com/jedgusse, @mikekestemont https://github.com/mikekestemont ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jedgusse/project_lorenzo/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJLyqfGvzjAGle9loDfnP2KgzSB_KQks5rz0mFgaJpZM4NI4Ob .

emanjavacas commented 7 years ago

Alright, then perhaps we should backup the data on the server and clean the repo history. We should then add the preprocessing scripts all of them taking as argument the path to the souce file, and perhaps also adding another one to download the original files, so that we can recreate the corpus at any time. I can take care of the first but I will wait until we know what @jedgusse thinks about this :-)

mikekestemont commented 7 years ago

excellent idea.

Prof. Dr. Mike Kestemont | www.mike-kestemont.org | Twitter: @Mike_Kestemont | mike.kestemont@uantwerp.be | mike.kestemont@gmail.com | University of Antwerp | City Campus, Prinsstraat 13, room D. 118 I B-2000 Antwerp, Belgium | tel. +32 (0)3 265.42.54

Check out our documentary on Digital Humanities and Hildegard of Bingen: watch it in HD on Vimeo: https://vimeo.com/70881172

On Wed, Apr 26, 2017 at 3:48 PM, Enrique Manjavacas < notifications@github.com> wrote:

Alright, then perhaps we should backup the data on the server and clean the repo history. We should then add the preprocessing scripts all of them taking as argument the path to the souce file, and perhaps also adding another one to download the original files, so that we can recreate the corpus at any time. I can take care of the first but I will wait until we know what @jedgusse https://github.com/jedgusse thinks about this :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jedgusse/project_lorenzo/issues/2#issuecomment-297414045, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJL3B8Hckas7XSBBLJNC_qR9mYugyVks5rz0scgaJpZM4NI4Ob .

jedgusse commented 7 years ago

Hi both, I've pushed the preprocessing code to the page. It will not work directly on the downloaded corpora, since I have made some automatic adjustments in the data's organization, for instance within the Perseus XML (which is troublesome indeed, but it worked for me to plough my way through it using the ploughing.py script I have added to the repository). I could send you the Perseus XML as I have reorganized it? Or should I put it on GitHub too?

emanjavacas commented 7 years ago

I think it's fine if we store your "ploughed" data on the server or somewhere where we can download it ourselves and treat it as original data.

2017-04-26 15:58 GMT+02:00 jedgusse notifications@github.com:

Hi both, I've pushed the preprocessing code to the page. It will not work directly on the downloaded corpora, since I have made some automatic adjustments in the data's organization, for instance within the Perseus XML (which is troublesome indeed, but it worked for me to plough my way through it using the ploughing.py script I have added to the repository). I could send you the Perseus XML as I have reorganized it? Or should I put it on GitHub too?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jedgusse/project_lorenzo/issues/2#issuecomment-297417177, or mute the thread https://github.com/notifications/unsubscribe-auth/AF6HowC4a57Rvqh-y6J9y4jZysm7UsT-ks5rz02bgaJpZM4NI4Ob .

-- Enrique Manjavacas.

emanjavacas commented 7 years ago

Hi, I've removed the history and backed up the data to the servers /home/manjavacas/data/lorenzo_data.tar.gz (including the new patrologia data: patrologia_rnr for patrologia rock&roll)

The repo is clean from data commits (actually from all commits).

emanjavacas commented 7 years ago

Hi guys, I just realized that all the old commit history is back there. I am not sure why, but probably you will have to reclone the repository (otherwise the commit history will be back whenever you push again, since it didn't get remove from your local copies). I am gonna clean it up again and try to remove the repo and clone again so that you continue working from the clean version :-)

jedgusse commented 7 years ago

This is probably my fault! We should discuss this on Friday. :-)

mikekestemont commented 7 years ago

Aha, thanks for the notice; let us get to the bottom of this on Friday. Mike

On 2 May 2017, at 10:16, jedgusse notifications@github.com wrote:

This is probably my fault! We should discuss this on Friday. :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.