Feature Request: Employ Standard Unix Input Convention for `--reads`

cbird808 commented 2 months ago

Hi, great package!

I have a request, or I'd be happy to do the work and send you pull req if you'd implement these.

In the unix philosophy, every command accepts input file(s) rather than directories. Following this, when calling --paired, I suggest making --reads, --fwd, and --rev accept a pattern (glob) to match file names rather than a directory name, and you hardcoding the file path and naming convention. It's trivial to change this and would also make the behavior of --reads match not only expectation based on all unix commands, but also its behavior when --single is set. It would also allow you to divorce from locking the end user into having fastq in the file name. fq is also common.
With the previous request implemented, you can extract the FASTQ file extension from the argument given for --reads, --fwd, and --rev

With these changes, the end users are freed from having to rename their files to match the expectations of rainbow_bridge.

mhoban commented 2 months ago

This is a perfectly reasonable request. I think there was a reason I didn't do it this way in the first place, but I can't think of why that was. It may have had partially to do with the fact that inputs could be any of single or multiple forward and/or reverse reads, depending on the (de)multiplexing strateg(ies, see e.g. #82 ) used for the sequencing run.

I firmly agree that users shouldn't have to rename their files to match internal expectations of the pipeline.

I'd welcome the pull request, but any implementation will need to allow the multiple input scenarios.

cbird808 commented 2 months ago

Cool. I'm speaking from experience here. My intro to bash was forking ddocent 10 years ago. After I gained experience, I realized one original sin of that software is having an inflexible file naming convention and not following unix conventions. It causes issues to this day.

Here's my thoughts on how to make the changes:

You probably set the default --reads based on either --single or --paired being called. The default for -paired -reads changes from "." To "./.fq*"
- The code will need to be searched for usage of $reads and updated as necessary to comply with a file path glob rather than a dir
For -fwd and -rev (note my computer autocorrects -- to -), the defaults are "./${r1}.fq." and "./${r2}.fq" . if they are user-specified then -reads is ignored.
- The code will need to be searched for usage of $fwd and $rev and updated as necessary to comply with a file path glob rather than a dir
The present code used to piece together the file paths from -reads, --fwd, and -rev will be deprecated/ commented out
- Any dependencies on variables created here will need to be rectified
The fasta extension will be stored in a variable and extracted using FASTQ=$(basename "$reads" | sed -E 's/.*\.([^.]*(\.gz)?)$/\1/'), or equivalent for $fwd $rev
- Every instance of fastq will need to be replaced with the variable containing the file extension

Any thoughts on this?

From: mykle hoban @.> Sent: Thursday, August 15, 2024 4:24 PM To: mhoban/rainbow_bridge @.> Cc: Bird, Chris @.>; Author @.> Subject: Re: [mhoban/rainbow_bridge] Feature Request: Employ Standard Unix Input Convention for --reads (Issue #81)

This is a perfectly reasonable request. I think there was a reason I didn't do it this way in the first place, but I can't think of why that was. It may have had partially to do with the fact that inputs could be any of single or multiple forward and/or reverse reads, depending on the (de)multiplexing strateg(ies, see e.g. #82https://github.com/mhoban/rainbow_bridge/issues/82 ) used for the sequencing run.

I firmly agree that users shouldn't have to rename their files to match internal expectations of the pipeline.

I'd welcome the pull request, but any implementation will need to allow the multiple input scenarios.

- Reply to this email directly, view it on GitHubhttps://github.com/mhoban/rainbow_bridge/issues/81#issuecomment-2292284442, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBV4SYM7RZ7YDRLXZULL2TZRUL7NAVCNFSM6AAAAABMS4QVX2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJSGI4DINBUGI. You are receiving this because you authored the thread.Message ID: @.**@.>>

cbird808 commented 2 months ago

Hi @mhoban ,

I've added glob handling functionality for --reads, --fwd, and --rev while maintaining the original functionality described in the manual. Use the attached files in place of the orig (brief README.md included). I've tested with the test_paired_demuxed.sh. By my reading of the code, the other tests shouldn't be affected.

As is, I think it will handle any combination of globs from --fwd and --rev. It will also handle globs from -reads that follow the input rules for a glob passed to Channel.fromFilePairs.

If I were to make one primary change to what I did, it would be to take the verbose NextFlow code used create a glob that's accepted by Channel.fromFilePairs from params.fwd and params.rev and rewrite it in pure groovy so it can be a helper. Because i was not familiar with nextflow, I learned the hard way that making the present nextflow code a process is a no-go because a channel, which is output by a process, cannot be converted back to a string, which is the only format accepted by Channel.fromFilePairs.

That being said, the core premise of the code I added, identifying the characters that are(nt) in common between strings, could be repurposed and modified to handle any --reads glob, as well as --fwd and --rev dirs without --reads.

Let me know how it works for you. ceb_update_1.zip

mhoban commented 2 months ago

Thanks for this! I'm in the process of working on some other updates but I will take a look at this and merge in your modifications once those are implemented. I appreciate that you have retained the existing functionality, since things were largely designed as they were for ease of use by less unix-savvy users (viz. grad students in the tobo lab)

cbird808 commented 2 months ago

Sounds good. Let me know if there are any unintended consequences that I missed.

To allow for any file extension, not just fastq, I think the most efficient solution is to extract it when you extract the sample ID and add it to the reads channel. Then update the processes that you pass reads to and replace fastq with $extension

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: mykle hoban @.> Sent: Wednesday, August 21, 2024 2:08:56 AM To: mhoban/rainbow_bridge @.> Cc: Bird, Chris @.>; Author @.> Subject: Re: [mhoban/rainbow_bridge] Feature Request: Employ Standard Unix Input Convention for --reads (Issue #81)

Thanks for this! I'm in the process of working on some other updates but I will take a look at this and merge in your modifications once those are implemented. I appreciate that you have retained the existing functionality, since things were largely designed as they were for ease of use by less unix-savvy users (viz. grad students in the tobo lab)

— Reply to this email directly, view it on GitHubhttps://github.com/mhoban/rainbow_bridge/issues/81#issuecomment-2301296655, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBV4S73QCT2MIGKO4GA3TTZSQ4IRAVCNFSM6AAAAABMS4QVX2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBRGI4TMNRVGU. You are receiving this because you authored the thread.Message ID: @.***>

mhoban commented 2 months ago

Thanks for you contribution, @cbird808! This is now all implemented in the branch 81-command-line and will soon be merged into main. I've also updated all the documentation. Major code changes can be found in 522ae01 and doc updates are in cbbf522.

I changed some stuff from your code but the functionality is all more or less the same. The pipeline should now support arbitrary file extensions through your updates, and it has been updated to look for '*.f*q\' if doing things the old way (so at least .fq, .fastq, .fq.gz, and .fastq.gz will work).

I'll also update the test repository in a branch of the same name (but it will probably be merged soon).

*notably your common-prefix code. Your implementation was very clever, but there was an issue whereby indexOf wasn't guaranteed to match the running index in the takeWhile closure. I updated it to use a simple for-loop and it resolved the problems I was having.

cbird808 commented 2 months ago

Great! Glad to contribute. In my experience, regardless of skill level, rainbow_bridge will be more widely adopted, and you'll receive fewer inquiries on how to use it, if at least the inputs work like most other packages.

From: mykle hoban @.> Sent: Friday, August 23, 2024 9:30 PM To: mhoban/rainbow_bridge @.> Cc: Bird, Chris @.>; Mention @.> Subject: Re: [mhoban/rainbow_bridge] Feature Request: Employ Standard Unix Input Convention for --reads (Issue #81)

Thanks for you contribution, @cbird808https://github.com/cbird808! This is now all implemented in the branch 81-command-linehttps://github.com/mhoban/rainbow_bridge/tree/81-command-line and will soon be merged into main. I've also updated all the documentation. Major code changes can be found in 522ae01https://github.com/mhoban/rainbow_bridge/commit/522ae0172a4076d1a7a5d0834b88fb26e7247a37 and doc updates are in cbbf522https://github.com/mhoban/rainbow_bridge/commit/cbbf522d66c6f42526112338492bd5774405fee8.

I changed some stuff from your code but the functionality is all more or less the same. The pipeline should now support arbitrary file extensions through your updates, and it has been updated to look for '.fq' if doing things the old way (so at least .fq, .fastq, .fq.gz, and .fastq.gz will work).

I'll also update the test repository in a branch of the same name (but it will probably be merged soon).

*notably your common-prefix code. Your implementation was very clever, but there was an issue whereby indexOf wasn't guaranteed to match the running index in the takeWhile closure. I updated it to use a simple for-loop and it resolved the problems I was having.

- Reply to this email directly, view it on GitHubhttps://github.com/mhoban/rainbow_bridge/issues/81#issuecomment-2308001361, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBV4S3ENIYA5OZWPIWLAADZS7V3LAVCNFSM6AAAAABMS4QVX2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBYGAYDCMZWGE. You are receiving this because you were mentioned.Message ID: @.**@.>>

mhoban commented 2 months ago

All updates have been merged as of e557b4c395ddfe5f4e81a7b977b92e3114933fbd. Closing for now, but feel free to reopen if you encounter further issues.

mhoban / rainbow_bridge

Feature Request: Employ Standard Unix Input Convention for `--reads` #81