vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
283 stars 53 forks source link

DIANN truncating analysis.tdf? #1166

Closed jsnedeco closed 2 months ago

jsnedeco commented 2 months ago

I'm running DIANN 1.8 through Wine and we're running into a strange issue where when we use files that have been manually uploaded onto a Windows EC2 instance are producing different results than ones that are uploaded on S3 and then downloaded later. Even just running the sample by itself on the manually uploaded file generates around 8000 precursor hits while it only gets around 4000 on the file downloaded from S3.

We are running DIA-PASEF data on a Bruker TimsTof and the MD5 hashes for the downloaded analysis.tdf and analysis.tdf_bin both match. However, after running DIANN on the S3 downloaded files, the analysis.tdf file is truncated to 32MB and the MD5 hashes no longer match:

Before DIANN image

After DIANN image

I can send you the raw .d file if that would help along with logs for the local and s3 downloaded runs. I feel like there has to be an issue with the S3 transfer, but I can't figure out why the MD5 hashes for the analysis files match in that case and I REALLY don't understand why DIANN is truncating the file after running.

jsnedeco commented 2 months ago

I think I figured it out, if the sample is being accessed when it's copied to S3 (possibly by DIA-NN?) it'll create a analysis.tdf-wal file. This file is supposed to basically be a copy of analysis.tdf that SQLite uses to track transactions to the original database, but since we're copying in the middle of when the analysis.tdf database is being accessed it probably doesn't have the full database. Then when DIA-NN is run, SQLite syncs those changes back to analysis.tdf. Rerunning after deleting the -wal file seems to address the problem.