naobservatory / mgs-workflow

2 stars 0 forks source link

Options for allowing more diverse read names #24

Closed mikemc closed 5 days ago

mikemc commented 1 week ago

Some options

Allow for additional common patterns

e.g. patterns output by various demultiplexing software, SRA downloads, etc.

Require the user to specify in the libraries.csv file

It is rare for bioinformatics tools to require the read names to have a specific file name structure; instead they require the user to explicitly specify the read names. We could do easily achieve this by having columns for the forward and reverse read file names in the libraries.csv file.

This second option currently seems best to me since it is maximally general, allows for not altering the received data, and provides a useful check that the specified names are what are actually in the raw/ folder.

mikemc commented 1 week ago

Another way we can better support different read names is to allow following symbolic links as the read files in the process 'CONCAT_GZIPPED'. Rather than renaming the raw files, I would prefer to just create symlinks with correctly formatted names (so as to preserve the raw data and avoid unneeded file duplication). But currently 'CONCAT_GZIPPED' fails if the files in the raw/ folder are symlinks, e.g.

Command error:
  Raw files directory: raw
  Sample: DSI_dplt_110123
  Libraries: DSI_dplt_110123
  Forward read files: 1
  Reverse read files: 1
  Read 1 files to concatenate: raw/DSI_dplt_110123_S6_L001_1.fastq.gz
  Read 2 files to concatenate: raw/DSI_dplt_110123_S6_L001_2.fastq.gz
  Only one file per read pair; copying.
  cp: cannot stat 'raw/DSI_dplt_110123_S6_L001_1.fastq.gz': No such file or directory

To support symlinks, I think the needed changes are simply to add the -L flag to the the copying step,

      cp -L ${r1} ${out1}
      cp -L ${r2} ${out2}

It seems that the subsequent test step using cmp will already compare the symlinked file (rather than the symlink itself)

Edit: The issue seems to be something different. cp versus cp -L seem to both correctly copy the file the symlink points to. When I run the .command.sh script in the work directory after a failure, it seems to work fine. So I think there's some sort of interaction with nextflow and/or AWS I don't understand.

willbradshaw commented 5 days ago

Closing this as duplicative with #16; have linked to this issue from that one for visibility of @mikemc's comments.