Ribofrac: Check for empty files and skip samples with zero reads

lennijusten commented 8 months ago

In past commits to main I added a feature to ribofrac() that counted the total reads in total_reads_dict and returned np.nan if the sample had zero reads. However, when running the new NAO data through ribofrac(), several samples returned np.nan even though they definitely has reads in them.

Once possible cause of this issue is that the files where somehow empty or not appropriately copied over from AWS. I've added a file_integrity_check() function that checks potential inputs (minus the ".settings" and ".discarded" files) to see if 1) the path exists after copying down from AWS, and 2) if the files contain any reads.

If all the potential inputs do not contain any reads or don't exist, the ribofrac() will skip the sample and not output anything to AWS. This seems better than outputting np.nan without being certain that files were correctly processed, potentially misleading people. It also allows the pipeline to be re-run without the appearance of existing, potentially incorrect, ribofrac entries.

jeffkaufman commented 8 months ago

@lennijusten can you resolve conflicts with current main, and then ask me to review again?

jeffkaufman commented 8 months ago

(note that when it says I force-pushed that's actually you force pushing from a deploy key you asked me to add :( )

lennijusten commented 8 months ago

@jeffkaufman Fixed! Can you review?

naobservatory / mgs-pipeline

Ribofrac: Check for empty files and skip samples with zero reads #30