swcarpentry / shell-novice

The Unix Shell
http://swcarpentry.github.io/shell-novice/
Other
391 stars 979 forks source link

Good Names for Files and Directories #743

Open jmhuculak opened 6 years ago

jmhuculak commented 6 years ago

For instruction #3 of "Good Names for files and directories", the text says, "Stick with letters, numbers, . (period), - (dash) and _ (underscore)."

We should specify the convention that the period is only used to denote the file extension. As currently written, a new user might create the following filename: "Draft.Worksheet.Names2018.txt"; It would be clearer if we say, "Stick with letters, numbers, - (dash), (underscore), ending with . (period) followed by extension type"

clarkfitzg commented 6 years ago

+1

clarkfitzg commented 6 years ago

It may also be useful to link to a more thorough explanation of best practices, ie: http://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming which talks about incorporating dates to filenames, etc.

jttkim commented 6 years ago

I'm not convinced by these Stanford best practices. The suggestion to have version numbers in file names is rather inconsistent with the concepts and practices covered in the git lesson. More generally, I'd consider things like researcher name, dates, geographical coordinates etc. to be metadata which should best be kept in files themselves. Discussing metadata and perhaps the FAIR principles would be way too much of a detour / distraction in this context though, so I'd suggest to stick with the simple guideline of restricting characters to a safe subset of letters, numbers, underscores etc. Thinking ahead of "extensions" with more or fewer than 3 letters (e.g. .gz, .fastq etc.) and even dual "extensions" (.tar.gz, fastq.gz etc.) I'd prefer to present having only one such "extension" as a convention that may be more or less consistently used, depending on OS and software systems used.

clarkfitzg commented 6 years ago

I just read the portion in the lesson again and see that it doesn't mention consistency, which is the most important thing in my mind. That is, if you join an existing project that has files foo.1.txt, ..., foo.42.txt then it's better to name the next file foo.43.txt rather than foo_43.txt. The latter may well break something.

Simplicity should be a priority. I took the Stanford reference as a relatively loose set of guidelines. If it potentially makes things more confusing then there's no need to link to it. Addressing a couple of @jttkim's points:

The suggestion to have version numbers in file names is rather inconsistent with the concepts and practices covered in the git lesson.

Earlier this year I was working with several other people on versions of a 4 GB file named foo_v1.txt.gz, ..., foo_v7.txt.gz. This is too big to fit in Git. Furthermore, we passed the file around on network drives, outside of Git repos. The file name made it immediately obvious which version we were on for everyone involved. Is there any better way than a file name to do this? Honest question.

More generally, I'd consider things like researcher name, dates, geographical coordinates etc. to be metadata which should best be kept in files themselves.

Ideally, yes. But not all file formats support this.

clarkfitzg commented 6 years ago

Another possible way to make this clear and simple for users is to list positive and negative examples along with the explanations, ie:

Safe names for files:

Problematic names for files:

jttkim commented 6 years ago

@clarkfitzg I agree that using naming patterns consistently is a good idea almost all of the times, and I think I understand the spirit of the examples you give. My main concern is that learners who are very new to the shell will not have much experience and background to make sense of such recommendations. Therefore I think the purpose of discussing of file naming is to prepare learners for the rest of the shell lesson, and not so much to set out consequences and perils of various approaches to file naming.

As a comment on the 4G file processed by various people, would it be possible to think of the versions of this file as different stages of processing? Generally, it would be desirable that all those who modify the file do that in an automated and reproducible way, e.g. by some script, so that, rather than having to store a series of versions, storing the primary file (i.e. the first version) and all scripts to process that into the final result is all that's necessary and sufficient to document and reproduce the analysis process. And even if that's not possible, I'd look into reflecting processing steps in the file names and / or in metadata, e.g. foo.txt, foo-completeonly.txt where all incomplete records are stripped out, foo-completeonly-withgeolocation.txt where latitude + longitude data is added based on addresses etc. But again -- I think appreciating this discussion requires experience that you obviously have, but that we can't assume learners to have at this stage.

gcapes commented 6 years ago

Re: version control of large files I've not used it, but have heard good things about this: https://git-lfs.github.com/

clarkfitzg commented 6 years ago

I appreciate the thoughts on file versioning. The use case with large files I described above was dumps of a database after various changes were made in the system. In hindsight I think incorporating a timestamp into the filename might work reasonably well for this because then the creation timestamp would be preserved as the file is copied. But as @jttkim points out we don't need to bring this kind of discussion / nuance into the lesson.

In the interest of moving this forward towards resolution, here are the changes to the lesson that have been suggested in this thread:

  1. clarify periods are used for file extensions
  2. add link to Stanford (or other) best practices for file naming
  3. list positive and negative examples for file names rather than "just" explaining them

The periods for file extensions are standard practice, while we've been debating the two others. I'm for the link, because I found that particular resource helpful for sorting file names. I like the positive and negative examples because they seem more immediate to me than reading through the actual text, ie. DO THIS: ... NOT THIS: .... We don't need to add any more content- if anything the link and examples would let us trim some of the current content.

jttkim commented 6 years ago

I suggest adopting something like the original suggestion by @jmhuculak about extensions. Mentioning this convention briefly is nice for learners who have heard that term somewhere already and all learners will benefit from having a bit of terminology for talking about file names.

My main concern with entering into best practices at this stage of the episode is that some learners, importantly those with little or no previous experience, have no background that enables them to understand and appreciate their relevance (i.e. they have never had to work with larger sets of files that are inconsistently named and where essential metadata is scant and difficult to obtain). Hitting them with a raft of best practices will unnecessarily distract and confuse some of them. Any discussion of this sort is way above the level of e.g. the "What's In A Name" discussion later in episode 3 -- so adding the material here will disrupt the flow of the episode.

For this reason I'd also prefer to restrict examples to positive and simple. Examples that enable learners choose names that are likely to give them some initial experience of commands working simply and "normally" (before inevitably encountering those brain-teasers of escaping and masking leading minuses etc.) are good, while complex examples are likely to distract inexperienced learners and negative ones may additionally concern them unnecessarily.

Generally I think that some additional discussion of file naming could have some merit. I particularly like the consistency point made by @clarkfitzg, and I think the optional exercises at the end of this episode would be a good place for that. I'd like to hear more instructors on the Stanford link, though -- having a link to best practices suggesting version components in filenames in the shell and the "FINAL".doc cartoon in the git lesson is bound to lead to remarks by learners in the not too distant future. So perhaps this could be discussed in the mailing list?

clarkfitzg commented 6 years ago

The more general question here is: what's SWC's stance on external links to resources that aren't official documentation?

Before bringing it up on the mailing list maybe maintainer @gdevenyi can provide some guidance here.

colinmorris commented 6 years ago

I don't think it's unprecedented. There are lots of non-official links in the reference section and the instructor guide.

If we think something like the Stanford page is worth learners' time and attention despite 1 or 2 questionable pieces of advice, I think there's probably a way to introduce it that implies that this is just one (non-authoritative) perspective which they might find valuable, but shouldn't necessarily accept uncritically.

(Whether the first part of that sentence holds in this case isn't something I have a strong opinion on.)