Pseudo Length File not made

oma219 / spumoni

Pan-Genomic Matching Statistics

https://doi.org/10.1016/j.isci.2021.102696

GNU General Public License v3.0

48 stars 7 forks source link

Pseudo Length File not made #3

Closed meerakulous closed 2 years ago

meerakulous commented 2 years ago

I ran spumoni with this command "./spumoni run -r -p -P -f " None of the output files ended with "pseudo_lengths." Which files should I feed into analyze_pml.py?

oma219 commented 2 years ago

Hello,

Could you share the output of spumoni when you ran this command?: spumoni run -r <ref_file> -p <read_file> -P -f

The output files should be in the same directory as the read file.

This README describes the input for the analyze_pml.py script. It essentially requires two different *.pseudo_lengths files to compare.

meerakulous commented 2 years ago

Here are all the files generated when I run spumoni with my reference file pangenomehuman.fasta

oma219 commented 2 years ago

Okay, so that looks like the directory where all the index files are stored. Most of them are temporary files. In the directory, that you have your reads file, you should see a file that ends with *.pseudo_lengths?

Typically, I put the reference file in one folder, and the reads file in another. It seems like that might be what you did as well, since I don't see any files in that screenshot that appear to be the reads.

meerakulous commented 2 years ago

Sorry for the late reply -- I can now calculate the pseudo length file. Now I'm having an issue with calculating the matching statistics. Each of my pseudo length files are 62 GB and I'm not able to run analyze_pml.py on my server with 225 GB of RAM.

oma219 commented 2 years ago

I see, well is there a particular error that you see that makes you think that you cannot run analyze_pml.py?

One way around this for now, is that you can take a small portion of *.pseudo_lengths file by running something like head -n 2000 *.pseudo_lengths which would extract the results for the first 1000 reads since each read's results consists of two lines. Then you can run analyze_pml.py on that smaller file. It doesn't have to be 1000 reads, it is just an example.

In the coming weeks, I plan on integrating in the analyze_pml.py into the main SPUMONI code, as well as some additional code which will make those *.pseudo_lengths files smaller.

meerakulous commented 2 years ago

Taking a small portion of the pseudo_lengths works! I think just the size of my data is causing my job where I run analyze_pml to get killed.

oma219 commented 2 years ago

That is great, I'll close the issue then. Like I mentioned above, I hope to make some commits in the coming weeks to make the process a little more streamlined and less memory intensive for large datasets.