giovannarosone / BCR_LCP_GSA

Multi-string eBWT/LCP/GSA computation
BSD 2-Clause "Simplified" License
5 stars 1 forks source link

BWT for a large, multisample fasta file #5

Open robertwhbaldwin opened 2 years ago

robertwhbaldwin commented 2 years ago

Hi,

I've been having problems getting BCR to finish. I have a large (~450 G) multisample fasta file. I'm running BCR with default settings and I've got it working on a smaller input fasta file. For the big file though it runs for a few days and stops with no error messages or anything. The log file is below. Any advice would be appreciated. Thanks - Robert

BWTCollection: The input is ./merged.fasta BWTCollection: The output is ./merged_out Compute the EBWT The output format of BCR is at most 4 files (ebwt, lcp, da, posSA) at the same time. BCR uses the external memory for the BWT partial BCR of multi-sequences Lexicographic order dataTypedimAlpha: sizeof(type size of alpha): 1 bytes dataTypelenSeq: sizeof(type of seq length): 1 bytes dataTypeNSeq: sizeof(type of #sequences): 4 bytes dataTypeNChar: sizeof(type of #sequences): 8 bytes TIMER start buildBCR User: 0s System: 2.3e-05s Actual: 2.4e-05s Efficiency: 95.8333%

Builds cyc. files and the builds the BCR The (new and-or old) reads have a different length. Number of sequences reading/writing: 1951361762 Number of characters reading/writing: 287390043283 In the new collection, we have: TrasposeFasta: init buf_ for bases of size 151 * 219689734 TransposeFasta: The max length (Read) is: 151 TransposeFasta: Number of reads: 1951361762 TransposeFasta: Total Number of chars (without end-markers): 287390043283 TransposeFasta: Size Alpha: 5 symbols TIMER after TRASP. User: 2e-06s System: 0s Actual: 3e-06s Efficiency: 66.6667% Symbols in the input file (ASCII, char, freq, code): 35 # 1951361762 0 65 A 80887928969 1 67 C 63212786497 2 71 G 61669656023 3 84 T 81619671794 4

Start Preprocessing 1633779567 seconds End Preprocessing 1633788058 seconds Preprocessing tooks 8491 seconds

SizeAlpha: 5 Length of the longest sequence: 151

Number of sequences: 1951361762 Total symbols (without end-markers): 287390043283 Total symbols (with end-markers) in ebwt: 289341405045

Total (max) RAM for BCR for computing eBWT (int/ext) and/or LCP and/or DA and/or SA (including the buffers for reading files): 61411 MebiByte (MiB)

giovannarosone commented 2 years ago

Hi Robert, Thanks for writing me. From the BCR log I see that you only want to build the EBWT. The file contains 1,951,361,762 sequences of maximum length 151 on the alphabet {A, C, G, T}. So at the end of the first step, you should have 151 files prefixed with cyc (you can also remove them using the appropriate script). Each cyc file should have the length equal to the number of sequences. That's right?

Do you have enough ram to hold the necessary information for each string? Do you have enough disk space?

Best, Giovanna

robertwhbaldwin commented 2 years ago

Thank You for the response. I have 64 G RAM and ~600 Gigs of storage for the run. Do you think that this is a storage/RAM issue?

robertwhbaldwin commented 2 years ago

I only counted 84 cyc files. Each one was 1.9 G. After it fails I still have lots of space on my drive: /dev/sda 916G 674G 196G 78% /data

But the egap log file had this message:

" Using 60827 MBs of RAM Sending logging messages to file: merged.fasta.eGap.log ==== gSACAK Command: /home/robert/tools/egap/tools/gsacak-64 -b -m 60827 merged.fasta 0 Error executing command line: /home/robert/tools/egap/tools/gsacak-64 -b -m 60827 merged.fasta 0 Check log file: merged.fasta.eGap.log "

And there is no merged.fasta.eGap.log.

Here's a screenshot of what's in the folder:

Screenshot from 2021-10-18 10-59-53

giovannarosone commented 2 years ago

There is something that went wrong:

Could you tell me the length of the longest sequence in the input file? 83 or 150?

Could you input the original file (before merge) to BCR? What happen?