Closed LeeBergstrand closed 1 year ago
So, the typical use case I'm targeting is the raw Illumina files in Casava format from the basecaller.
L2S357_15_L001_R2_001.fastq.gz
I didn't want to include the lane information, so I used the left of the first _
as the identifier.
@jmtsuji I think people naming their FASTQ files with something like P_putida
is a corner case, no? Wouldn't people usually use a sample ID or strain ID at the beginning of the FASTQ file?
@LeeBergstrand It depends on where you get your sequencing data from. For example, I think the JGI might change the raw FastQ file names to the sample IDs supplied by the user before providing the files for the user to download (I don't 100% remember now). I also imagine that some users like to change their FastQ names to sample IDs like this for readability. But you are right, I think what I am describing is an edge case.
One easy solution to this issue could be:
rotary init
just takes the text before the first underscore and uses that as the sample ID, as you've currently coded itrotary init
(e.g., --keep_full_sample_names
) to use full sample names (i.e., everything before the R1/R2)rotary init
and gets a naming collision error, they could turn on the flag and run init
againThoughts?
@jmtsuji This is a possible idea. However, at the moment, this logic is deeply embedded in the object system for representing sequencing files. See:
https://github.com/jmtsuji/rotary/blob/fa0a8effc5dedd57c217bc239d57ca51f15f49d3/rotary/sample.py#L14
It might take a bit of rework to make things work the way you suggested.
The code currently does the following:
Other things the code gets around:
So, I did a lot of defensive programming here to ensure the correct files were assigned proper samples. Note that my code currently requires the Illumina data for each sample, but we should modify it down the road to ensure that the Illumina files are an option.
I think there is an opportunity here down the road to make some adjustments.
I see -- sounds tricky to modify the logic at the moment. It might work well to revisit having an alternative way to parse the file names (as a low priority feature request) at the same time as changing the code to allow for optional short read data, at some point in the future. Sounds good to set this as a long-term TODO.
Another thought about sample naming in general (note: this is also not a high priority issue and is more of a feature request):
One problem users might run into is that the short read data and long read data will probably have different raw filenames when they come off the sequencer. I like your idea to look for the nanopore files with the same sample ID (e.g., L2S357
) as the short read data. At the same time, this will probably require the user to manually rename (or symlink) the nanopore files (to make their sample IDs match the short read files) before they can run rotary init
.
To give more options to the user, when we modify the sample.py code in the future, we might consider letting rotary init
be able to just find short read data (R1, R2) and no long read data, then build the samples.tsv
. That way, the user could point rotary init
at their short read folder to make a template samples.tsv file. They could also point rotary init
at their long read data folder and make a second template samples.tsv file. Then, the user could manually match these up (e.g., via Python, Excel, etc.) and make the final samples.tsv that they want to use for their run.
In other words, if we are already planning to make short read data optional for rotary init
, then it might work well to make a second case where long read data is optional, too. Might help for some users down the road. What do you think?
One problem users might run into is that the short-read data and long-read data will probably have different raw filenames when they come off the sequencer. I like your idea to look for the nanopore files with the same sample ID (e.g.,
L2S357
) as the short read data. At the same time, this will probably require the user to manually rename (or symlink) the nanopore files (to make their sample IDs match the short read files) before they can runrotary init
.
It's best practice for end-users to make sure all the files for a sample have the same identifier. If not this way, it would be tough for a machine to determine which files belong to which samples. If you're using CLI to run rotary
, you can probably use rename
or mv
to rename your files. Any GUI version of this software could handle the file naming in code. I could see lifting this limitation for the rotary run_one
, as you specified these files when running the CLI.
To give more options to the user, when we modify the sample.py code in the future, we might consider letting
rotary init
be able to just find short read data (R1, R2) and no long read data, then build thesamples.tsv
. That way, the user could pointrotary init
at their short read folder to make a template samples.tsv file. They could also pointrotary init
at their long read data folder and make a second template samples.tsv file. Then, the user could manually match these up (e.g., via Python, Excel, etc.) and make the final samples.tsv that they want to use for their run.
I would put this functionality in a dedicated CLI script or subcommand rather than the main rotary init
script. Depending on the user's skillset, this might be more work than just renaming the files in CLI.
Another alternative is to have a version of ``rotary init``` that uses the directory tree to get the sample name. For example, if each of the files is held in its own directory named after the sample_id.
@jmtsuji
Thanks for the discussion. That sounds good to remove the requirement for run_one
but keep it for init
. The directory tree idea is interesting, but I imagine it might be kind of a challenge to implement using the current code base... so it is probably a low priority to implement this. Sounds good?
Thanks for the discussion. That sounds good to remove the requirement for
run_one
but keep it forinit
. The directory tree idea is interesting, but I imagine it might be kind of a challenge to implement using the current code base... so it is probably a low priority to implement this. Sounds good?
Sounds good!
@jmtsuji Addressed in https://github.com/jmtsuji/rotary/pull/69
_Originally posted by @jmtsuji in https://github.com/jmtsuji/rotary/pull/46#discussion_r1363354035_