linsalrob / fastq-pair

Match up paired end fastq files quickly and efficiently.
https://edwards.flinders.edu.au/sorting-and-paring-fastq-files/
MIT License
142 stars 32 forks source link

nhanced FastQ-Pair Tool with New Features and Improvements #23

Open f-huber opened 2 months ago

f-huber commented 2 months ago

Overview

This pull request introduces several important updates to the fastq-pair tool, enhancing its functionality and improving overall user experience. The following changes have been implemented:

Key Features and Changes:

  1. Added support for gzipped FastQ files:

    • The tool now fully supports input and output of .gz compressed FastQ files.
    • This addition allows seamless handling of large datasets, avoiding the need for users to uncompress and re-compress FastQ files. This reduces disk space usage and improves overall efficiency.
  2. Added option for entries deduplication:

    • Introduced an optional feature to remove duplicated entries per file, based on the entry names.
  3. Added option for identifier reformatting:

    • Introduced an option to reformat sequence identifiers to the minimal identifier (before space), allowing better compatibility with some downstream analysis tools.
  4. Changed the output filename routine:

    • The output filenames are now generated by retaining only the basename and removing the extension.
    • This provides clearer, more consistent file naming and avoids redundancy.
  5. Updated the test dataset:

    • The test dataset has been revised to reflect the new features and ensure comprehensive testing coverage for both gzipped files and new options.
  6. Updated README:

    • Documentation has been updated to reflect the new options and features, providing clear instructions for users on how to utilize the new functionality.

Rationale

These updates address requests from the community for better support of gzipped FastQ files. The improvements also streamline output file management and ensure users have up-to-date documentation and testing resources.

Testing

The changes have been tested using the updated dataset to ensure compatibility with both gzipped and uncompressed FastQ files, and to validate the deduplication and identifier reformatting features.