sourmash-bio / sourmash_plugin_directsketch

BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

`gbsketch` does not report sequences failed to sketch if checksum fails #70

Closed ccbaumler closed 2 weeks ago

ccbaumler commented 3 months ago

Report files from updating genbank databases

https://farm.cse.ucdavis.edu/~baumlerc/report.20240712-archaea.html https://farm.cse.ucdavis.edu/~baumlerc/report.20240712-fungi.html https://farm.cse.ucdavis.edu/~baumlerc/report.20240712-protozoa.html https://farm.cse.ucdavis.edu/~baumlerc/report.20240712-viral.html

In the Genome report for the updated database section, I have reported the sourmash sig check output. Showing that a large number of sequences were not sketched into the final database.

In the next section Genomes failed to download or sketch, I have reported the file created by gbsketch --failed.

The error

I found that the --failed file does not report the invalid checksum files found in log file. The counts for missing and failed checksums is approximately the same.

Error: Invalid checksum line format in URL https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/023/535/875/GCA_023535875.1_ASM2353587v1/md5checksums.txt: <head>
Command Count Report Count
grep "Error" logs/gather_sketch_missing.20240712_viral.log \| wc -l 15816 15828
grep "Error" logs/gather_sketch_missing.20240712_protozoa.log \| wc -l 1001 1004
grep "Error" logs/gather_sketch_missing.20240712_archaea.log \| wc -l 866 869
grep "Error" logs/gather_sketch_missing.20240712_fungi.log \| wc -l 7593 7774

when the search term is changed to "checksum" it is the same count as "Error"

Appendix of code chunks

less /home/baumlerc/database-releases/genbank-workflow/logs/gather_sketch_missing.20240712_viral.log
grep "Error" /home/baumlerc/database-releases/genbank-workflow/logs/gather_sketch_missing.20240712_protozoa.log | wc -l
ccbaumler commented 3 months ago

Upon testing with a smaller set of samples, these same check sum issues do not seem to be present.

accession,name,ftp_path
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,
GCA_000175535.1,GCA_000175535.1 Chlamydia muridarum MopnTet14 (agent of mouse pneumonitis) strain=MopnTet14,
GCA_023535875.1,GCA_023535875.1 Cowpox virus strain=HumKre08/1 ASM2353587v1,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/023/535/875/GCA_023535875.1_ASM2353587v1
GCA_023533615.1,GCA_023533615.1 Cowpox virus strain=Austria 1999 ASM2353361v1,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/023/533/615/GCA_023533615.1_ASM2353361v1
sourmash scripts gbsketch acc.csv -o test-gbsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1 2> test.log
ccbaumler commented 3 months ago

I have extracted all the failed accessions via this script:

#!/bin/bash

# Input parameters
OUTPUT_PATH="$1"
DATE="$2"
DOMAIN="$3"

# Check if DATE and DOMAIN are provided
if [ -z "$OUTPUT_PATH" ] || [ -z "$DATE" ] || [ -z "$DOMAIN" ]; then
    echo "Usage: $0 OUTPUT_PATH DATE DOMAIN"
    exit 1
fi

# Define file paths based on inputs
MISSING="${OUTPUT_PATH}/data/missing-genomes.${DATE}-${DOMAIN}.csv"
REVERSION="${OUTPUT_PATH}/data/updated-versions.${DATE}-${DOMAIN}.csv"
CHECK="${OUTPUT_PATH}/data/genbank-${DATE}-${DOMAIN}.missing.csv"
OUTPUT="${OUTPUT_PATH}/workflow-cleanup/manual-download.${DATE}-${DOMAIN}.csv"
MANUAL="${OUTPUT_PATH}/workflow-cleanup/manual-check.${DATE}-${DOMAIN}.csv"

# Create the output file with header
echo -e "accession,name,ftp_path" > "$OUTPUT"

# Process the missing genomes and update the output file
awk -F, 'NR>1 && FNR==NR { a[$1]=$0; next } NR>1 && $1 in a { print a[$1] }' "$MISSING" "$CHECK" >> "$OUTPUT"

# Append updated genomes to the output file
awk -F, 'NR>1 && FNR==NR { a[$1]=$0; next } NR>1 && $1 in a { print a[$1] }' "$REVERSION" "$CHECK" >> "$OUTPUT"

# Find genomes missed in the output file compared to the check file
awk -F, 'NR>1 && FNR==NR { a[$1]=$0; next } FNR>1 && !($1 in a) { print $0 }' "$OUTPUT" "$CHECK" > "$MANUAL"

CLI:

$ ./scripts/gather_failed.sh /group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20240712 20240712 archaea
$ ./scripts/gather_failed.sh /group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20240712 20240712 fungi
$ ./scripts/gather_failed.sh /group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20240712 20240712 protozoa
$ ./scripts/gather_failed.sh /group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20240712 20240712 viral

The files for these are stored at /group/ctbrowngrp4/2024-ccbaumler-genbank/genbank-20240712/workflow-cleanup

I am going to run gbsketch from these accessions and see if it works outside of a snakemake workflow for some reason.

ccbaumler commented 3 months ago

Each of these commands ran and created a database without any errors. I'll append them on the existing database now.

ctb commented 1 month ago

@ccbaumler and I looked at directsketch.rs and the problem seems to be that the method dl_sketch_assembly_accession returns an Err if the checksums file is a bad format (probably due to too high a request load, or something else). This means we just get an error in the output log but not something nice and machine readable.

If we want to make this accessible to colton-esque people to rerun failed checksums, we should produce yet another file output that has the links that have failed to produce a good checksum file. A simple way to do this would be to change dl_sketch_assembly_accession to return a 3-tuple rather than a 2-tuple, with a boolean that says if checksum failed. Then we could modify gbsketch to output a specialized log file containing checksums that have failed.