rhysnewell / aviary

A hybrid assembly and MAG recovery pipeline (and more!)
GNU General Public License v3.0
76 stars 11 forks source link

convert rule spades_assembly to a python script #201

Closed YibiChen closed 2 months ago

AroneyS commented 2 months ago

Do the integration tests all succeed?

YibiChen commented 2 months ago

Yes, the integration tests all succeed except one. This is relevant to test_batch_recovery, where long read file is given, but long read type is "none". In my script (aviary/modules/assembly/scripts/spades_assembly.py 63-65), I raised a ValueError for this scenario whereas in the earlier rule this would be defaulted to pacbio. Is there any reason to set the default type as pacbio? I believe nanopore is more common.

YibiChen commented 2 months ago

I also have another concern about the tmpdir. In test_short_read_recovery_queue_submission, the tmpdir I got from qsub-logs is still /tmp, which is the $TMPDIR in the local env where I ran aviary. Is this correct? If I were to run bin chicken, would the aviary command be running on the remote node?

rhysnewell commented 2 months ago

Yes, the integration tests all succeed except one. This is relevant to test_batch_recovery, where long read file is given, but long read type is "none". In my script (aviary/modules/assembly/scripts/spades_assembly.py 63-65), I raised a ValueError for this scenario whereas in the earlier rule this would be defaulted to pacbio. Is there any reason to set the default type as pacbio? I believe nanopore is more common.

This doesn't seem to make a whole lot of sense. The default long read type in the CLI is ont, none is not a valid option. You shouldn't be supplying "none" as a long read type in the first place. If you are asking why pacbio was the final branch in the if statement originally, then it's because there are 4 different pacbio read types and only 2 ont types so it's programatically easier to catch the first 2 ont types and then safely assume pacbio after that.

If you don't have long reads, then this rule shouldn't even be getting run

YibiChen commented 2 months ago

Yes, the integration tests all succeed except one. This is relevant to test_batch_recovery, where long read file is given, but long read type is "none". In my script (aviary/modules/assembly/scripts/spades_assembly.py 63-65), I raised a ValueError for this scenario whereas in the earlier rule this would be defaulted to pacbio. Is there any reason to set the default type as pacbio? I believe nanopore is more common.

This doesn't seem to make a whole lot of sense. The default long read type in the CLI is ont, none is not a valid option. You shouldn't be supplying "none" as a long read type in the first place. If you are asking why pacbio was the final branch in the if statement originally, then it's because there are 4 different pacbio read types and only 2 ont types so it's programatically easier to catch the first 2 ont types and then safely assume pacbio after that.

If you don't have long reads, then this rule shouldn't even be getting run

I am sorry I didn't explain this well. I just thought it would be nice to have more sanity check. A good example is the test case for batch submission (see test/data/example_batch.tsv). Although long read type is "ont" for both samples here, it becomes "none" in the individual config.yaml files. This is likely causing by an index mistake somewhere. More importantly, the command would finish without any complains, even though spade were running in pacbio mode not nanopore. This could also happen if the user misspelled "ont" or used uppercase.

I can resubmit the pull request without the sanity check if you think it's not nessesary or I can correct the current version with the correct pacbio codes. And thanks for all the comments, they are very helpful.

AroneyS commented 2 months ago

The choices for long-read-type are restricted by the cli (https://github.com/rhysnewell/aviary/blob/78321260f1f1dd8e527079ff23b3404ce7d55cae/aviary/aviary.py#L481), so typos will error immediately.

rhysnewell commented 2 months ago

The choices for long-read-type are restricted by the cli (

https://github.com/rhysnewell/aviary/blob/78321260f1f1dd8e527079ff23b3404ce7d55cae/aviary/aviary.py#L481 ), so typos will error immediately.

I think I understand what Yibi is saying, but I don't exactly understand how the long_read_type became none. The batch input doesn't have the same safety check in place that the CLI does, users could potentially introduce a typo in their batch file if the user isn't writing the batch scripts to a file

YibiChen commented 2 months ago

But batch mode doesn't seem to have this restriction.

YibiChen commented 2 months ago

@rhysnewell maybe you can try to run the test case and figure it out?

AroneyS commented 2 months ago

My vote would be to add checks for batch mode, rather than worry at each stage if the variable is garbage. Though in a separate PR

rhysnewell commented 2 months ago

@YibiChen I've pinpointed the issue here and it's an easy fix, but it will go in a separate PR. Just remove your pacbio check from your script and respond to the other comments, thanks

YibiChen commented 2 months ago

The integration tests all succeed. Let me know if you spot anything else.