psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
57 stars 34 forks source link

AIRR output with paired-loci #311

Closed elijgarcia closed 3 years ago

elijgarcia commented 3 years ago

I am having issues outputting AIRR-formatted yaml's when running partis partition on BCR sequences that were made with the 10x pipeline. I am able to run the partition just fine using the --paired-loci and --paired-outdir. However, when I specify I want an -airr-output, it raises the exception that I must have an --outfname, but that's not possible when using the --paired-loci argument. I am trying to reformat the output so I can use it for the olmested project. Is there alternative ways to reformat paired loci data into the AIRR format?

An example of the code I'm running:

partis partition --infname ./path-to/filtered_contig_123.fasta --paired-loci --species mouse --airr-output --paired-outdir ./path-to/out123-1

Which will then raise the Exception: have to set --outfname if --airr-output is set

psathyrella commented 3 years ago

wow, you're using paired loci, that's awesome! If you see anything (else...) weird at all, do please open an issue -- it's stable enough that nothing big will change in terms of behavior, but it is still a bit bleeding edge.

As to your issue, yep I just don't have --airr-output in testing, so hadn't noticed that I needed to update it to use --paired-outdir rather than --outfname. That's a quick fix I should get to today, I just need to clean up something first.

Also, that's great that you're working on reformatting output for olmsted -- that's been our todo list for far too long, so we'd definitely be interested in incorporating your changes if you'd submit pull requests when you're done.

psathyrella commented 3 years ago

ok this should do it.

elijgarcia commented 3 years ago

It seems to be working quite well, we often use 10x sequencing on our sorting of PBMCs so it's great that you added that feature. Thank you for your speedy response and fix!

Oh I'm simply trying to use the tools that those creators made! It quite amazing what they have done

elijgarcia commented 3 years ago

I am still getting the same error:

Traceback (most recent call last):
  File "/opt/applications/partis/0.16.0/gnu/bin/partis", line 1066, in <module>
    processargs.process(args)
  File "/opt/applications/partis/0.16.0/gnu/python/processargs.py", line 250, in process
    raise Exception('have to set --outfname if --airr-output is set')
Exception: have to set --outfname if --airr-output is set
Traceback (most recent call last):
  File "/opt/applications/partis/0.16.0/gnu/bin/partis", line 1070, in <module>
    args.func(args)
  File "/opt/applications/partis/0.16.0/gnu/bin/partis", line 260, in run_partitiondriver
    run_all_loci(args)
  File "/opt/applications/partis/0.16.0/gnu/bin/partis", line 749, in run_all_loci
    run_step('cache-parameters', ltmp, auto_cache=True, skip_missing_input=True)
  File "/opt/applications/partis/0.16.0/gnu/bin/partis", line 507, in run_step
    utils.simplerun(' '.join(prep_args(ltmp)), dryrun=args.dry_run)
  File "/opt/applications/partis/0.16.0/gnu/python/utils.py", line 3458, in simplerun
    subprocess.check_call(cmd_str if shell else cmd_str.split(), env=os.environ, shell=shell)
  File "/opt/applications/python/2.7.11/gnu/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['/opt/applications/partis/0.16.0/gnu/bin/partis', 'cache-parameters', '--locus', 'igh', '--infname', './malaria/mouse_nterm_klh_2021/out2home-3/out126-airr/igh.fa', '--species', 'mouse', '--airr-output', '--parameter-dir', './malaria/mouse_nterm_klh_2021/out2home-3/out126-airr/parameters/igh', '--input-metafname', './malaria/mouse_nterm_klh_2021/out2home-3/out126-airr/meta.yaml', '--sw-cachefname', './malaria/mouse_nterm_klh_2021/out2home-3/out126-airr/parameters/igh/sw-cache.yaml']' returned non-zero exit status 1

Is there a way to convert or create a copy of the partition-igh.yaml? I saw there was a closed issue that led to the created --airr-output argument, but I wasn't sure if there was a way to do the conversion after a partition was run.

psathyrella commented 3 years ago

hmm, that really shouldn't be possible now if --paired-loci is set. Could you run with --print-git-commit to make sure you picked up the most recent version?

elijgarcia commented 3 years ago

I am still getting the same Exception error, and I got the following from --print-git-commit:

commit: 625898dbbc9f96398954e10f236ef24f5f4e78a8
     tag: 0.16.0  (well, 397 commits ahead of)

That is on my personal computer where I can pull new docker images quite easily. The high performance computing core at the institute I work at recently updated to parts/0.16.0 and I believe they are on commit 376. Although, I did ask them to change that one line of code, and I am still getting the error.

psathyrella commented 3 years ago

hmmm that's super weird. That is the correct commit hash, but the exception is coming from the old code -- in the trace above it's at line 250, which is where it was in the last docker image, but in the code from that commit hash it's at line 255.

elijgarcia commented 3 years ago

Oh sorry for the confusion, the error stdout above was from the HPC node that only had that line changed at 250. But when I had the latest commit on the docker container, it referenced line 255

psathyrella commented 3 years ago

whoops, sorry, you must be auto parameter caching (i.e. not running a separate cache-parameters step first), I forgot to check that possibility. This should do it. It should finish building on docker hub in a half hour or so.

elijgarcia commented 3 years ago

It's working great on my end now, thank you for the speedy fix! Is there a benefit to running the cache-parameters/annotation/partition steps individually? From the user standpoint it might be easier to diagnose an issue (although your stdout when there is an error is generally very helpful!), but I'm wondering what your opinion/logic on it

psathyrella commented 3 years ago

Great!

There might be some useful thoughts here. But mostly the reason I almost always run a separate cache-parameters step is that it's safer, particularly in the context of production/real data. If I run them separately, especially with --refuse-to-cache-parameters set for partitioning, then I can be sure that the right options were used for parameter caching, and that it was run on all sequences, and the parameters went to where i expect them to. For instance I'm usually running several different flavors of partitioning (different seed sequences, different random subsamples, different stopping criteria) on the same cached parameters. Or for instance if you change the sequences in the input file without changing its name, things will be completely wrong if run as one step (since it'll use the old parameters), but fine if you cache parameters separately.

If you're just running once without setting any special command line args, running as one step is fine, but if you're doing more complicated things it's probably safer to do two steps.

elijgarcia commented 3 years ago

I see, thank you for your insight!

yyw-informatics commented 2 years ago

Hi, I have a quick question on the same topic - AIRR output with paired-loci. The options --paired-loci and --airr-output work very well for my run, but the output tsv file is saved under the folder "single-chain". Could you clarify which file is the results of the paired chain clonal type partition? Many thanks :-)

psathyrella commented 2 years ago

Each airr output tsv corresponds to the regular partition yaml next to which it appears -- i.e. the airr tsvs in the single-chain/ dir are for single chain partitions, while the joint/paired partitions are in the main output dir: https://github.com/psathyrella/partis/blob/main/docs/paired-loci.md#output-directory.

e.g. this

./bin/partis partition --paired-indir test/paired/ref-results/test/simu --parameter-dir test/paired/ref-results/test/parameters/simu --paired-outdir _output/tmp-pair --paired-loci --airr-output

gives this dir structure:

[thneed] partis/ > find _output/tmp-pair -type f
_output/tmp-pair/partition-igh.yaml
_output/tmp-pair/single-chain/partition-igl.tsv
_output/tmp-pair/single-chain/partition-igk.yaml
_output/tmp-pair/single-chain/partition-igh.yaml
_output/tmp-pair/single-chain/partition-igk.tsv
_output/tmp-pair/single-chain/partition-igl.yaml
_output/tmp-pair/single-chain/partition-igh.tsv
_output/tmp-pair/igh+igk/partition-igk.yaml
_output/tmp-pair/igh+igk/partition-igh.yaml
_output/tmp-pair/igh+igk/partition-igk.tsv
_output/tmp-pair/igh+igk/partition-igh.tsv
_output/tmp-pair/partition-igh.tsv
_output/tmp-pair/igh+igl/partition-igl.tsv
_output/tmp-pair/igh+igl/partition-igh.yaml
_output/tmp-pair/igh+igl/partition-igl.yaml
_output/tmp-pair/igh+igl/partition-igh.tsv
yyw-informatics commented 2 years ago

Thanks for rapid response!! Looking at my results, I have the igh+igk/l folders, but I don't have any .yaml or .tsv files in these folders. I only have .fa files in the paired chain folders. That's why I was confused about the results.

yyw-informatics commented 2 years ago

Perhaps I made mistakes here? Please the following: bin/partis partition --infname /mydata/data/$SAM/filtered_contig.fasta \ --paired-loci \ --airr-output \ --paired-outdir /mydata/results/$SAM \ --plotdir /mydata/figs/$SAM \ --get-selection-metrics

Thank you so much for helps.

psathyrella commented 2 years ago

can you paste the full std out?