Auto-naming of samples may have collision issues.

LeeBergstrand commented 1 year ago

          I can think of cases where keeping the text before the first underscore could lead to naming collisions between different samples (e.g., if someone had samples like `P_putida`, `P_aeruginosa`, etc.). Might it be better to strip off the final R1 / R2 and then keep everything before that as the sample ID? Not sure...

_Originally posted by @jmtsuji in https://github.com/jmtsuji/rotary/pull/46#discussion_r1363354035_

LeeBergstrand commented 1 year ago

So, the typical use case I'm targeting is the raw Illumina files in Casava format from the basecaller.

L2S357_15_L001_R2_001.fastq.gz

I didn't want to include the lane information, so I used the left of the first _ as the identifier.

@jmtsuji I think people naming their FASTQ files with something like P_putida is a corner case, no? Wouldn't people usually use a sample ID or strain ID at the beginning of the FASTQ file?

jmtsuji commented 1 year ago

@LeeBergstrand It depends on where you get your sequencing data from. For example, I think the JGI might change the raw FastQ file names to the sample IDs supplied by the user before providing the files for the user to download (I don't 100% remember now). I also imagine that some users like to change their FastQ names to sample IDs like this for readability. But you are right, I think what I am describing is an edge case.

One easy solution to this issue could be:

By default, rotary init just takes the text before the first underscore and uses that as the sample ID, as you've currently coded it
We add a flag to rotary init (e.g., --keep_full_sample_names) to use full sample names (i.e., everything before the R1/R2)
Then, if a user tries to run rotary init and gets a naming collision error, they could turn on the flag and run init again

Thoughts?

LeeBergstrand commented 1 year ago

@jmtsuji This is a possible idea. However, at the moment, this logic is deeply embedded in the object system for representing sequencing files. See:

https://github.com/jmtsuji/rotary/blob/fa0a8effc5dedd57c217bc239d57ca51f15f49d3/rotary/sample.py#L14

It might take a bit of rework to make things work the way you suggested.

The code currently does the following:

Extract the identifier from each of the files.
Make sure the folder contains only three files for each sample.
Make sure that all three sample files have the same identifier.
Make sure each sample has a file without an R*, one with an R1, and one with an R2.

Other things the code gets around:

What if the identifier has an R1/R2 in it? (i.e. TR1V2_R1.fastq)
Successfully extracts the identifier from the nanopore FASTQ where no R1/R2 is present.

So, I did a lot of defensive programming here to ensure the correct files were assigned proper samples. Note that my code currently requires the Illumina data for each sample, but we should modify it down the road to ensure that the Illumina files are an option.

LeeBergstrand commented 1 year ago

I think there is an opportunity here down the road to make some adjustments.

jmtsuji commented 1 year ago

I see -- sounds tricky to modify the logic at the moment. It might work well to revisit having an alternative way to parse the file names (as a low priority feature request) at the same time as changing the code to allow for optional short read data, at some point in the future. Sounds good to set this as a long-term TODO.

jmtsuji commented 1 year ago

Another thought about sample naming in general (note: this is also not a high priority issue and is more of a feature request):

One problem users might run into is that the short read data and long read data will probably have different raw filenames when they come off the sequencer. I like your idea to look for the nanopore files with the same sample ID (e.g., L2S357) as the short read data. At the same time, this will probably require the user to manually rename (or symlink) the nanopore files (to make their sample IDs match the short read files) before they can run rotary init.

To give more options to the user, when we modify the sample.py code in the future, we might consider letting rotary init be able to just find short read data (R1, R2) and no long read data, then build the samples.tsv. That way, the user could point rotary init at their short read folder to make a template samples.tsv file. They could also point rotary init at their long read data folder and make a second template samples.tsv file. Then, the user could manually match these up (e.g., via Python, Excel, etc.) and make the final samples.tsv that they want to use for their run.

In other words, if we are already planning to make short read data optional for rotary init, then it might work well to make a second case where long read data is optional, too. Might help for some users down the road. What do you think?

LeeBergstrand commented 1 year ago

One problem users might run into is that the short-read data and long-read data will probably have different raw filenames when they come off the sequencer. I like your idea to look for the nanopore files with the same sample ID (e.g., L2S357) as the short read data. At the same time, this will probably require the user to manually rename (or symlink) the nanopore files (to make their sample IDs match the short read files) before they can run rotary init.

It's best practice for end-users to make sure all the files for a sample have the same identifier. If not this way, it would be tough for a machine to determine which files belong to which samples. If you're using CLI to run rotary, you can probably use rename or mv to rename your files. Any GUI version of this software could handle the file naming in code. I could see lifting this limitation for the rotary run_one, as you specified these files when running the CLI.

To give more options to the user, when we modify the sample.py code in the future, we might consider letting rotary init be able to just find short read data (R1, R2) and no long read data, then build the samples.tsv. That way, the user could point rotary init at their short read folder to make a template samples.tsv file. They could also point rotary init at their long read data folder and make a second template samples.tsv file. Then, the user could manually match these up (e.g., via Python, Excel, etc.) and make the final samples.tsv that they want to use for their run.

I would put this functionality in a dedicated CLI script or subcommand rather than the main rotary init script. Depending on the user's skillset, this might be more work than just renaming the files in CLI.

Another alternative is to have a version of ``rotary init``` that uses the directory tree to get the sample name. For example, if each of the files is held in its own directory named after the sample_id.

@jmtsuji

jmtsuji commented 1 year ago

Thanks for the discussion. That sounds good to remove the requirement for run_one but keep it for init. The directory tree idea is interesting, but I imagine it might be kind of a challenge to implement using the current code base... so it is probably a low priority to implement this. Sounds good?

LeeBergstrand commented 1 year ago

Thanks for the discussion. That sounds good to remove the requirement for run_one but keep it for init. The directory tree idea is interesting, but I imagine it might be kind of a challenge to implement using the current code base... so it is probably a low priority to implement this. Sounds good?

Sounds good!

LeeBergstrand commented 1 year ago

@jmtsuji Addressed in https://github.com/jmtsuji/rotary/pull/69

rotary-genomics / rotary

Auto-naming of samples may have collision issues. #53