Open shabnamkadir opened 8 years ago
This is a feature not a bug, but you're right that it probably needs better documenting somewhere...
What was the reasoning behind implementing it this way?! I can't see the benefits. I believe it has just caused havoc on the servers and on Legion, e.g. if if two jobs try to write to .klustakwik2/spike_clusters.txt at once for example.
I now have to run everything again!
I'm not entirely sure I'm afraid...
I mean, if you have several .dat files in a single directory and then you try to cluster all of them at once - you end up in an utterly disastrous situation. Every job will try to access and write to .klustakwik2/spike_clusters.txt
and they will all get confused!
The way around it for now, if for the user to put every .dat file in its own directory.
I think we made this decision when thinking about the new file format (i.e. moving away from HDF5).
We considered two options:
Each experiment is identified by a file root, e.g. experiment1.dat – and it then generates files such as experiment1.spike_times.npy, experiment1.spike_clusters.npy, etc. This is the same philosophy as the Csicsvari format
Each experiment lives in its own directory, and all files have the same name.
I remember a lot of discussion back and forth, and although I originally favored (1) out of familiarity, we settled on (2). I remember there being some quite convincing arguments, but forget what they were now!
We could try to dig them up in old emails.
ATB k
From: Shabnam Kadir [mailto:notifications@github.com] Sent: 18 May 2016 15:03 To: kwikteam/klusta klusta@noreply.github.com Subject: Re: [kwikteam/klusta] .klustakwik2 folder problem when more than one kwik file in same folder (#23)
The way around it for now, if for the user to put every .dat file in its own directory.
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHubhttps://github.com/kwikteam/klusta/issues/23#issuecomment-220035722
Thanks. I'd be curious to see the reasoning. I can't think of a single argument in favour of (2) right now! Every output file is still labelled by the experiment name, so this is in effect implementing (1). It's a shame that the .klustakwik files are inconsistent in this way. I think we have a bug, because at the moment we have a halfway house between (1) and (2).
I think there should be a big warning somewhere prominent in the docs. The unsuspecting user may have all their .dat files in one directory and not think to keep them separately. Also, I doubt most users will even check folder beginning with a '.' - they are often hidden.
@shabnamkadir you are right that this is not ideal. We started from (1) and we moved half-way to (2). The new format used by the template matching algorithms is pure (2). Since we'll be moving toward (2) eventually I suggest we get our users used to have 1 experiment per folder. This could be made clearer in the docs -- PR welcome...
21:22:41 [I] launch:214 Spike detection done! 21:22:42 [I] launch:239 Starting clustering on shank 0/1. 21:22:42 [I] launch:242 Clustering group 0 (354282 spikes). 21:22:45 [W] launch:112 Unable to resume KK from
/scratch/scratch/smgxsk1/kilonips/20150601/.klustakwik2/spike_clusters.txt
, there are 804147 values instead of 354282 21:22:45 [I] launch:122 Starting KK...There is only one .klustakwik2 folder produced. This is not produced inside a subfolder that is specific to the running job. If klusta is attempted on two files in a single directory - the last clustering is therefore lost.