stjude-rust-labs / fq

Command line utility for manipulating Illumina-generated FASTQ files.
MIT License
77 stars 5 forks source link

`NamesValidator` issue #34

Closed bounlu closed 11 months ago

bounlu commented 11 months ago

Hi,

I get an unexpected error for NamesValidator:

$ fq lint Auto_C1_1_val_1.fq.gz_unmapped_reads_1.fq.gz Auto_C1_2_val_2.fq.gz_unmapped_reads_2.fq.gz
2023-12-07T03:16:57.574994Z  INFO fq::commands::lint: fq-lint start
2023-12-07T03:16:57.575069Z  INFO fq::commands::lint: validating paired end reads
2023-12-07T03:16:57.575108Z  INFO fq::validators: disabled validators: []
2023-12-07T03:16:57.575116Z  INFO fq::validators: enabled single read validators: ["[S003] NameValidator", "[S004] CompleteValidator", "[S002] AlphabetValidator", "[S001] PlusLineValidator", "[S005] ConsistentSeqQualValidator", "[S006] QualityStringValidator"]
2023-12-07T03:16:57.575134Z  INFO fq::validators: enabled paired read validators: ["[P001] NamesValidator"]
2023-12-07T03:16:57.575149Z  INFO fq::commands::lint: enabled special validators: ["[S007] DuplicateNameValidator"]
2023-12-07T03:16:57.575154Z  INFO fq::commands::lint: starting validation (pass 1)
Auto_C1_1_val_1.fq.gz_unmapped_reads_1.fq.gz:1:1: [P001] NamesValidator: names mismatch: expected '@A01726:43:HTWT7DSX7:3:1101:12735:1000_2:N:0:ACACGGTT+TGGTTCGA', got '@A01726:43:HTWT7DSX7:3:1101:12735:1000_1:N:0:ACACGGTT+TGGTTCGA'

This is due to that the underscore _ in the read name is not recognized, which is added by Bismark during processing as explained here. The validator parses the read name until the first space and in this case the space is replaced with underscore hence they don't match between R1 and R2.

I believe the code needs to be fixed to handle such cases.

zaeleus commented 11 months ago

Thanks for reporting, @bounlu.

I wasn't aware underscores were being used as a definition separator. 690481e6d7db0e01c782779c4c2cb246720ba2b8 adds an --record-definition-separator option to the lint command to override the default separators (/ and (space)). E.g., in your command:

$ fq lint \
    --record-definition-separator _ \
    Auto_C1_1_val_1.fq.gz_unmapped_reads_1.fq.gz \
    Auto_C1_2_val_2.fq.gz_unmapped_reads_2.fq.gz
bounlu commented 11 months ago

Thanks for the quick fix. But this is not generalizable. I want to apply the same code to all sorts of FASTQ file without specifying the separator. Otherwise this needs extra code on the user side to check the separator from the input file.

Can we adopt this to check the first line of FASTQ file to determine the separator automatically?

zaeleus commented 10 months ago

The option as currently implemented seems to be the appropriate solution, as there is no way to determine the separator with a heuristic. For example, "sq 1|extra" is ambiguous.

FASTQ does not have a standard specification, and whitespace is the de facto definition separator. We also include the forward slash (/) because fq was originally built strictly for Illumina read names.