matsengrp / linearham

A Bayesian Phylo-HMM for B cell receptor sequence analysis
http://matsengrp.github.io/linearham
6 stars 4 forks source link

Docker image missing mafft #55

Closed wfs-lilly closed 4 years ago

wfs-lilly commented 5 years ago

Trying to using your image at Dockerhub but I reach a point where mafft is called and it isn't part of the build.

(I'd also note that your build file doesn't seem to pull in the revbayes package as part of the build, but perhaps you address that manually since it's so big.)

eharkins commented 5 years ago

Hi @wfs-lilly, would you please show how you are using the image and how you are running Linearham inside the container so we can have more context for how mafft gets called (presumably from the partis submodule depending on how you ran)?

I'm not sure what you mean by

your build file doesn't seem to pull in the revbayes package as part of the build

since revbayes is installed in the container.

wfs-lilly commented 5 years ago

Hi,

Thanks for the quick response!

I’m trying to use the system on a file of immunized mouse sequences in FASTA format, already stitched. I’m following the instructions on the github site to –run-partis. I used the command exactly as it appears, with no parameter dir.

I don’t have the stack trace around, but I see that there are calls to mafft in partis/python/glutils.py and partis/python/utils.py – I’m pretty sure it was the utils.py call because I remember seeing subprocess.check_call in the error listing.

The container does have revbayes – I just had trouble when I tried to rebuild it with the mafft dependency added. This problem was due to my ignorance of how you link repos in git, so I had not cloned recursively and actually didn’t have any of those dependencies. Sorry for the false alarm there.

Bill

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

From: Elias Harkins notifications@github.com Sent: Thursday, November 7, 2019 10:49 AM To: matsengrp/linearham linearham@noreply.github.com Cc: William Smith - Network smith_william1@network.lilly.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [matsengrp/linearham] Docker image missing mafft (#55)

EXTERNAL EMAIL: Use caution before replying, clicking links, and opening attachments.

Hi @wfs-lillyhttps://github.com/wfs-lilly, would you please show how you are using the image and how you are running Linearham inside the container so we can have more context for how mafft gets called (presumably from the partis submodule depending on how you ran)?

I'm not sure what you mean by

your build file doesn't seem to pull in the revbayes package as part of the build

since revbayes is installed in the container.https://github.com/matsengrp/linearham/blob/master/Dockerfile#L38

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=ANU5HJRDEV5QCY2APRMAAXLQSRPKBA5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDNNMCY#issuecomment-551212555, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANU5HJSZEDMWGO7OK6CY5D3QSRPKBANCNFSM4JKCJEQA.

wfs-lilly commented 5 years ago

Hi,

Another note, if I may. Now that I’m trying to clone the repo with submodules it fails because the link to tclap seems to be broken. It fails to be cloned and then all other submodule code is deleted (I guess to keep everything in a transaction, though git is something of a mystery to me).

Bill

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

From: Elias Harkins notifications@github.com Sent: Thursday, November 7, 2019 10:49 AM To: matsengrp/linearham linearham@noreply.github.com Cc: William Smith - Network smith_william1@network.lilly.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [matsengrp/linearham] Docker image missing mafft (#55)

EXTERNAL EMAIL: Use caution before replying, clicking links, and opening attachments.

Hi @wfs-lillyhttps://github.com/wfs-lilly, would you please show how you are using the image and how you are running Linearham inside the container so we can have more context for how mafft gets called (presumably from the partis submodule depending on how you ran)?

I'm not sure what you mean by

your build file doesn't seem to pull in the revbayes package as part of the build

since revbayes is installed in the container.https://github.com/matsengrp/linearham/blob/master/Dockerfile#L38

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=ANU5HJRDEV5QCY2APRMAAXLQSRPKBA5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDNNMCY#issuecomment-551212555, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANU5HJSZEDMWGO7OK6CY5D3QSRPKBANCNFSM4JKCJEQA.

eharkins commented 5 years ago

@wfs-lilly It's possible that your case is one that we had not encountered and so we may need to install mafft in the container. In the meantime, can you share the command you ran for now and the stack trace when you have access to it?

Regarding submodules, you shouldn't need to clone linearham at all to run inside the container, since the source code is in the container (at /linearham). However, it should be possible to clone using --recurse-submodules and get all the necessary submodules (if you wanted to make changes to the source code or for some other reason).

wfs-lilly commented 5 years ago

Hi,

I found that I did record the error output. Here it is. Note that I’m using singularity rather than docker, though I don’t think that should matter to this issue.

Singularity linearham_latest.sif:~/linearham> scons --run-partis --all-clonal-seqs --fasta-path=/home/v5x3838/v5x3838/sequences/TSPN-Bulk-VH-M6_S1_L001_RM_001.Annot.seq.fasta --num-cores=10 --outdir=/home/v5x3838/linearham/ scons: Reading SConscript files ... scons: done reading SConscript files. scons: Building targets ... lib/partis/bin/partis annotate --all-seqs-simultaneous --n-procs 8 --infname /home/v5x3838/v5x3838/sequences/TSPN-Bulk-VH-M6_S1_L001_RM_001.Annot.seq.fasta --parameter-dir /home/v5x3838/linearham/parameter_dir --locus igh --extra-annotation-columns linearham-info --outfname partis_run.yaml > partis_run.stdout.log /bin/sh: 1: mafft: not found Traceback (most recent call last): File "lib/partis/bin/partis", line 471, in args.func(args) File "lib/partis/bin/partis", line 217, in run_partitiondriver parter.run(actions) File "/home/v5x3838/linearham/lib/partis/python/partitiondriver.py", line 120, in run self.action_fcns[tmpaction]() File "/home/v5x3838/linearham/lib/partis/python/partitiondriver.py", line 276, in cache_parameters glutils.add_new_alleles(self.glfo, new_allele_info, debug=True, simglfo=self.simglfo, use_template_for_codon_info=False) # stuff is handled in (also note, can't use template for codon info since we may have already removed it) File "/home/v5x3838/linearham/lib/partis/python/glutils.py", line 894, in add_new_alleles add_new_allele(glfo, newfo, remove_template_genes=remove_template_genes, use_template_for_codon_info=use_template_for_codon_info, simglfo=simglfo, debug=debug) File "/home/v5x3838/linearham/lib/partis/python/glutils.py", line 957, in add_new_allele aligned_new_seq, aligned_template_seq = utils.color_mutants(glfo['seqs'][region][newfo['template-gene']], newfo['seq'], align=True, return_ref=True) File "/home/v5x3838/linearham/lib/partis/python/utils.py", line 991, in color_mutants ref_seq, seq = align_seqs(ref_seq, seq) File "/home/v5x3838/linearham/lib/partis/python/utils.py", line 932, in align_seqs subprocess.check_call('mafft --quiet %s >%s' % (fin.name, fout.name), shell=True) File "/usr/lib/python2.7/subprocess.py", line 186, in check_call raise CalledProcessError(retcode, cmd) CalledProcessError: Command 'mafft --quiet /tmp/tmpYKgtjr >/tmp/tmpFvuwAj' returned non-zero exit status 127 scons: *** [partis_run.yaml] Error 1 scons: building terminated because of errors.

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

From: Elias Harkins notifications@github.com Sent: Thursday, November 7, 2019 1:26 PM To: matsengrp/linearham linearham@noreply.github.com Cc: William Smith - Network smith_william1@network.lilly.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [matsengrp/linearham] Docker image missing mafft (#55)

EXTERNAL EMAIL: Use caution before replying, clicking links, and opening attachments.

@wfs-lillyhttps://github.com/wfs-lilly It's possible that your case is one that we had not encountered and so we may need to install mafft in the container. In the meantime, can you share the command you ran for now and the stack trace when you have access to it?

Regarding submodules, you shouldn't need to clone linearham at all to run inside the container, since the source code is in the container (at /linearham). However, it should be possible to clone using --recurse-submodules and get all the necessary submodules (if you wanted to make changes to the source code or for some other reason).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=ANU5HJQ47KMBEIRCS6K6AG3QSSBVHA5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDN3YQI#issuecomment-551271489, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANU5HJQTP2ORZAM4SXRNLF3QSSBVHANCNFSM4JKCJEQA.

wfs-lilly commented 5 years ago

Oh – and the reason I cloned was that I was trying to rebuild a container with the mafft dependency installed.

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

From: Elias Harkins notifications@github.com Sent: Thursday, November 7, 2019 1:26 PM To: matsengrp/linearham linearham@noreply.github.com Cc: William Smith - Network smith_william1@network.lilly.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [matsengrp/linearham] Docker image missing mafft (#55)

EXTERNAL EMAIL: Use caution before replying, clicking links, and opening attachments.

@wfs-lillyhttps://github.com/wfs-lilly It's possible that your case is one that we had not encountered and so we may need to install mafft in the container. In the meantime, can you share the command you ran for now and the stack trace when you have access to it?

Regarding submodules, you shouldn't need to clone linearham at all to run inside the container, since the source code is in the container (at /linearham). However, it should be possible to clone using --recurse-submodules and get all the necessary submodules (if you wanted to make changes to the source code or for some other reason).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=ANU5HJQ47KMBEIRCS6K6AG3QSSBVHA5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDN3YQI#issuecomment-551271489, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANU5HJQTP2ORZAM4SXRNLF3QSSBVHANCNFSM4JKCJEQA.

matsen commented 5 years ago

Hello there William!

Thanks for trying out our software and sorry it didn't work right off the bat. I think you are our first outside user!

We will add mafft to the container. However, the reason linearham/partis used mafft in the first place is because it realized that you weren't using the default germline database (which is human). We will add flags to the scons to allow you to specify that you have mouse sequences.

May I ask which flavor of mouse these sequences are from?

Erick

psathyrella commented 5 years ago

Hm, it looks to me like they were using the default initial database, but because the input sequences weren't human the germline inference step inferred lots of new alleles, which required mafft to align to print and get cyst/tryp info? In either case, yes, it's very important that the --species flag is set correctly in partis. I have not come up with a good automated way of warning people about this, but typically the results are crappy enough that I can't imagine one wouldn't notice.

wfs-lilly commented 5 years ago

Hi Duncan, Erick –

Thanks for being so responsive. This system seems like a perfect fit to a problem I’m trying to help our labs with, so it’s great to have found it.

The plot thickens a bit, however: the inputs are mouse, but humanized mouse, which we call Aliva (from a local company). We have a set of germline sequences we can use, but usually my first step is to align against human.

Thanks, Bill

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

From: Duncan Ralph notifications@github.com Sent: Friday, November 8, 2019 10:06 AM To: matsengrp/linearham linearham@noreply.github.com Cc: William Smith - Network smith_william1@network.lilly.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [matsengrp/linearham] Docker image missing mafft (#55)

EXTERNAL EMAIL: Use caution before replying, clicking links, and opening attachments.

Hm, it looks to me like they were using the default initial database, but because the sequences weren't human the germline inference step inferred lots of new alleles, which required mafft to align to print and get cyst/tryp info? In either case, yes, it's very important that the --species flag is set correctly in partis. I have not come up with a good automated way of warning people about this, but typically the results are crappy enough that I can't imagine one wouldn't notice.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=ANU5HJS6ISU3MQZYM336NWLQSWS7NA5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDS44GI#issuecomment-551931417, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANU5HJRX6C3BNVLDUOQRCCDQSWS7NANCNFSM4JKCJEQA.

matsen commented 5 years ago

OK. The results will certainly be better if you can input the germline sequences, which are precisely known in this case.

Are the germline sequences proprietary?

On Fri, Nov 8, 2019 at 10:14 AM wfs-lilly notifications@github.com wrote:

Hi Duncan, Erick –

Thanks for being so responsive. This system seems like a perfect fit to a problem I’m trying to help our labs with, so it’s great to have found it.

The plot thickens a bit, however: the inputs are mouse, but humanized mouse, which we call Aliva (from a local company). We have a set of germline sequences we can use, but usually my first step is to align against human.

Thanks, Bill

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

From: Duncan Ralph notifications@github.com Sent: Friday, November 8, 2019 10:06 AM To: matsengrp/linearham linearham@noreply.github.com Cc: William Smith - Network smith_william1@network.lilly.com; Mention < mention@noreply.github.com> Subject: [EXTERNAL] Re: [matsengrp/linearham] Docker image missing mafft (#55)

EXTERNAL EMAIL: Use caution before replying, clicking links, and opening attachments.

Hm, it looks to me like they were using the default initial database, but because the sequences weren't human the germline inference step inferred lots of new alleles, which required mafft to align to print and get cyst/tryp info? In either case, yes, it's very important that the --species flag is set correctly in partis. I have not come up with a good automated way of warning people about this, but typically the results are crappy enough that I can't imagine one wouldn't notice.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub< https://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=ANU5HJS6ISU3MQZYM336NWLQSWS7NA5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDS44GI#issuecomment-551931417>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/ANU5HJRX6C3BNVLDUOQRCCDQSWS7NANCNFSM4JKCJEQA>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=AAA3QRBOUGARKVWOJMDK2GLQSWT67A5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDS5R5I#issuecomment-551934197, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3QRHQTTL4LL4CKT55MZDQSWT67ANCNFSM4JKCJEQA .

-- Frederick "Erick" Matsen, Associate Member Fred Hutchinson Cancer Research Center http://matsen.fredhutch.org/

psathyrella commented 5 years ago

Ah, yeah, ok that's different. In that case --species human is probably the right choice, although there's only two main differences caused by the species flag 1) which initial germline dir gets pointed to (in data/germlines/) and 2) whether clustering-based germline inference is turned on or not. If you know the precise germline genes in your mice, I think it'd be quite important to both input your own germline dir, and also turn on --leave-default-germline so that both allele-removal and new-allele-inference are turned off.

These are all partis options, so it's feeling like maybe this is a use case where partis should be run directly, rather than through linearham's --run-partis.

It still seems like mafft would be worth including in linearham's docker file, though. It's in the partis one.

wfs-lilly commented 5 years ago

So then for right now (and, of course, my results were due last week) I should run partis separately using the appropriate germlines and then feed that output into linearham?

Bill

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

From: Duncan Ralph notifications@github.com Sent: Friday, November 8, 2019 10:37 AM To: matsengrp/linearham linearham@noreply.github.com Cc: William Smith - Network smith_william1@network.lilly.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [matsengrp/linearham] Docker image missing mafft (#55)

EXTERNAL EMAIL: Use caution before replying, clicking links, and opening attachments.

Ah, yeah, ok that's different. In that case --species human is probably the right choice, although there's only two main differences caused by the species flag 1) which initial germline dir gets pointed to (in data/germlines/) and 2) whether clustering-based germline inference is turned onhttps://github.com/psathyrella/partis/blob/master/docs/subcommands.md#germline-sets or not. If you know the precise germline genes in your mice, I think it'd be quite important to both input your own germline dir, and also turn on --leave-default-germline so that both allele-removal and new-allele-inference are turned off.

These are all partis options, so it's feeling like maybe this is a use case where partis should be run directly, rather than through linearham's --run-partis.

It still seems like mafft would be worth including in linearham's docker file, though. It's in the partis one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=ANU5HJVBRNHA2JOCWOOVZYTQSWWV5A5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDS7QJA#issuecomment-551942180, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANU5HJRVX3XG6JGLVPHERS3QSWWV5ANCNFSM4JKCJEQA.

psathyrella commented 5 years ago

Yes, that sounds right to me.

psathyrella commented 5 years ago

It would probably be a good idea to look at a few of the ascii-art annotations by hand to make sure things look good. A couple ways to do it, but running view-output on the output is probably the easiest.

wfs-lilly commented 5 years ago

Hi,

Could someone help me decode/debug the following?

I did as suggested and am running in the psathyrella/partis docker image for the initial analysis.

I’ve copied in my germlines to germlines/aliva/igh. Our names don’t match the partis requirements but that seems to be OK.

Note that I didn’t specify a species – I assumed that wouldn’t be necessary when providing the germlines.

I’ve highlighted the sections that I don’t understand – could you explain or point me to an explanation?

Also, obviously, the run crashed – help on that?

There are lots of tuning parameters – should I be using any? I’m trying to keep things as simple as possible for now.

This was running on my local Windows box under Docker Desktop. The deduplicated file has about 134,000 sequences. (Is there, BTW, any value in providing duplicates? Also, as you can see from the name, I’ve included only sequences which have at least 2 reads. Will that have adverse effects on the analysis?)

Thanks for all the help, Bill

(base) root@d2606784b59d:/host/home/mouse# /partis/bin/partis annotate --infname TSPN-Bulk-VH-M6_S1_L001_RM_001.Annot.seq.dedup2.fasta --outfname ./TSPN-m6-dedup2.yaml --locus igh --initial-germline-dir ./germlines/aliva --sanitize-input-germlines parameter dir '_output/TSPN-Bulk-VH-M6_S1_L001_RM_001.Annot.seq.dedup2' does not exist, so caching a new set of parameters before running action 'annotate' using default germline dir /partis/data/germlines/human for template glfo (e.g. for codon positions) reading igh locus glfo from ./germlines/aliva warning extra file ./germlines/aliva/igh/extras.csv not found (i.e. no codon positions for glfo) renamed 32 genes from ./germlines/aliva/igh/ighv.fasta: IGHV3-74 --> IGHV3-74x, IGHV3-73 --> IGHV3-73x, IGHV3-72 --> IGHV3-72x, IGHV2-70 --> IGHV2-70x, IGHV1-69 --> IGHV1-69x, IGHV3-66 --> IGHV3-66x, IGHV3-64 --> IGHV3-64x, IGHV4-59 --> IGHV4-59x, IGHV1-58 --> IGHV1-58x, IGHV3-53 --> IGHV3-53x [...] renamed 9 genes from ./germlines/aliva/igh/ighd.fasta: IGHD6-19 --> IGHD6-19x, IGHD1-20 --> IGHD1-20x, IGHD2-21 --> IGHD2-21x, IGHD3-22 --> IGHD3-22x, IGHD4-23 --> IGHD4-23x, IGHD5-24 --> IGHD5-24x, IGHD6-25 --> IGHD6-25x, IGHD1-26 --> IGHD1-26x, IGHD7-27 --> IGHD7-27x warning mutliple names in ./germlines/aliva/igh/ighd.fasta for the following sequences: IGHD5-501 IGHD5-1801 GTGGATACAGCTATGGTTAC it is highly recommended to remove the duplicates to avoid ambiguity renamed 6 genes from ./germlines/aliva/igh/ighj.fasta: IGHJ1 --> IGHJ1x, IGHJ2 --> IGHJ2x, IGHJ3 --> IGHJ3x, IGHJ4 --> IGHJ4x, IGHJ5 --> IGHJ5x, IGHJ6 --> IGHJ6*x missing 6 tryp positions adding gene with codon position hj201 from template glfo, to align against new genes had to add an N to the right side of template glfo's gene hj201, since it's in under a different name hj2x adding new allele to glfo: CTACTGGTACTTCGATCTCTGGGGCCGTGGCACCCTGGTCACTGTCTCCTCAGN hj201_TEMPLATE (no template) missing alignments for 7 j genes: hj2x hj5x hj3x hj201_TEMPLATE hj6x hj1x hj4x Traceback (most recent call last): File "/partis/bin/partis", line 472, in args.func(args) File "/partis/bin/partis", line 215, in run_partitiondriver input_info, reco_info, glfo, simglfo = read_inputs(args, actions) File "/partis/bin/partis", line 169, in read_inputs glfo = glutils.read_glfo(gldir, args.locus, only_genes=args.only_genes, template_glfo=template_glfo, add_dummy_name_components=args.sanitize_input_germlines, debug=2 if args.sanitize_input_germlines else False) File "/partis/python/glutils.py", line 598, in read_glfo get_missing_codon_info(glfo, template_glfo=template_glfo, remove_bad_genes=remove_bad_genes, debug=debug) File "/partis/python/glutils.py", line 386, in get_missing_codon_info aligned_seqs = get_new_alignments(glfo, region, debug=debug) File "/partis/python/glutils.py", line 325, in get_new_alignments ignore_extra_ids=True) File "/partis/python/utils.py", line 922, in align_many_seqs biggest_length = max(len(sfo['seq']) for sfo in existing_aligned_seqfos) ValueError: max() arg is an empty sequence

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

From: Duncan Ralph notifications@github.com Sent: Friday, November 8, 2019 10:55 AM To: matsengrp/linearham linearham@noreply.github.com Cc: William Smith - Network smith_william1@network.lilly.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [matsengrp/linearham] Docker image missing mafft (#55)

EXTERNAL EMAIL: Use caution before replying, clicking links, and opening attachments.

Yes, that sounds right to me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=ANU5HJS72BIHK5DVHISQOJ3QSWYW7A5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDTA53I#issuecomment-551948013, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANU5HJU57L5W5V54QRZ6HHTQSWYW7ANCNFSM4JKCJEQA.

psathyrella commented 5 years ago

Well as to the crash, it definitely shouldn't be crashing, I'll think through the code carefully to try to guess how it could end up with a zero length list there. In the meantime it might be easier if you have any way to provide the germline meta info so it doesn't need to do the alignment step. What's happening is the germline dirs normally include a csv file with codon information, and it didn't find one in the germline directory you passed. And it tries to be clever and guess which genes to use as templates and align them properly to get the cyst/tryp/phen positions, but it's not really very clever. If you are able to pass me the germline dir you're using, that would make it a lot easier to fix the underlying issue.

Our names don’t match the partis requirements but that seems to be OK.

yeah that's fine. It just changes them to match the requirements (in this case, it just needs something for the allele, so it adds 'x')

Note that I didn’t specify a species – I assumed that wouldn’t be necessary when providing the germlines.

yes, that's fine. That means it's defaulting to human.

While there certainly are an absurd number of command line arguments, very few of them are really for the purpose of tuning things -- the philosophy is definitely that users will run with default parameters the overwhelming majority of the time. The command line arguments are instead mostly for eliciting non-standard behavior, things analogous to telling it to use a non-default germline directory location.

This was running on my local Windows box under Docker Desktop. The deduplicated file has about 134,000 sequences. (Is there, BTW, any value in providing duplicates? Also, as you can see from the name, I’ve included only sequences which have at least 2 reads. Will that have adverse effects on the analysis?)

Nope, partis will immediately collapse any duplicates that it finds (although it does record number of duplicates in the output file). I can't speak to the optimal minimum number of reads per sequence, except to say that partis does zero of what we call "preprocessing" -- things such as sequencing error correction, umi handling, etc. So partis views every input sequence as an actual BCR, and any mismatch from germline as a somatic hypermutation. I.e. it assumes all preprocessing has already been performed.

psathyrella commented 5 years ago

Nevermind about sending your germline info, the crash was actually easy to replicate, and to fix it I just needed to set the offending empty list to None. Fixed it here.

wfs-lilly commented 5 years ago

linearham_aliva_failed.log partis_succeed_aliva.seq_obs.log

Hello,

I think I’ve done as you suggested, and I’ve run into a crash in linearham. Based on the error message, it appears to be a problem with my inputs, but I don’t really know what to do. May I ask for a quick look, please?

Here’s my process.

Deduplicate my FASTA file and remove all sequences with < 2 repeats. (This is because I’m running on my laptop with limited memory and just want to get the process straight. I’ll revisit this step after I get some results and spend the work to bring up the images on a bigger machine – not trivial within Lilly.)

Start the partis docker image. Copy in the fasta file and the Aliva mouse germline files into the expected locations. Execute partis (the log is attached with sequences obscured). (base) root@d2606784b59d:/host/home/mouse# /partis/bin/partis annotate --infname TSPN-Bulk-VH-M6_S1_L001_RM_001.Annot.seq.dedup2.fasta --outfname ./TSPN-m6-dedup2.yaml --locus igh --initial-germline-dir ./germlines/aliva --sanitize-input-germlines

Tar up the _output directory and copy it and the TSPN-m6-dedup2.yaml files onto my host. Start the partis/linearham docker image with a /bin/bash entrypoint Copy in the yaml and tar files from the previous steps; untar and locate them appropriately. Run linearham. (log attached) root@6fe89556c05f:/linearham# scons --run-linearham --partis-yaml-file=/home/TSPN-m6-dedup2_as_alive.yaml --parameter-dir=/home/_output/TSPN-Bulk-VH-M6_S1_L001_RM_001.Annot.seq.dedup2/

I see a few messages that indicate I should do something (duplicate germlines, e.g.), but none appear to be critical, so I don’t really know quite where to go to get the linearham analysis to complete.

Thanks, Bill

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

From: Duncan Ralph notifications@github.com Sent: Monday, November 11, 2019 4:17 PM To: matsengrp/linearham linearham@noreply.github.com Cc: William Smith - Network smith_william1@network.lilly.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [matsengrp/linearham] Docker image missing mafft (#55)

EXTERNAL EMAIL: Use caution before replying, clicking links, and opening attachments.

Well as to the crash, it definitely shouldn't be crashing, I'll think through the code carefully to try to guess how it could end up with a zero length list there. In the meantime it might be easier if you have any way to provide the germline meta info so it doesn't need to do the alignment step. What's happening is the germline dirs normally include a csv file with codon information, and it didn't find one in the germline directory you passed. And it tries to be clever and guess which genes to use as templates and align them properly to get the cyst/tryp/phen positions, but it's not really very clever. If you are able to pass me the germline dir you're using, that would make it a lot easier to fix the underlying issue.

Our names don’t match the partis requirements but that seems to be OK.

yeah that's fine. It just changes them to match the requirements (in this case, it just needs something for the allele, so it adds 'x')

Note that I didn’t specify a species – I assumed that wouldn’t be necessary when providing the germlines.

yes, that's fine. That means it's defaulting to human.

While there certainly are an absurd number of command line arguments, very few of them are really for the purpose of tuning things -- the philosophy is definitely that users will run with default parameters the overwhelming majority of the time. The command line arguments are instead mostly for eliciting non-standard behavior, things analogous to telling it to use a non-default germline directory location.

This was running on my local Windows box under Docker Desktop. The deduplicated file has about 134,000 sequences. (Is there, BTW, any value in providing duplicates? Also, as you can see from the name, I’ve included only sequences which have at least 2 reads. Will that have adverse effects on the analysis?)

Nope, partis will immediately collapse any duplicates that it finds (although it does record number of duplicates in the output file). I can't speak to the optimal minimum number of reads per sequence, except to say that partis does zero of what we call "preprocessing" -- things such as sequencing error correction, umi handling, etc. So partis views every input sequence as an actual BCR, and any mismatch from germline as a somatic hypermutation. I.e. it assumes all preprocessing has already been performed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=ANU5HJRG5BY7JNJEJFHBN3LQTHYXLA5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDYSZUY#issuecomment-552676563, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANU5HJUK44AHS2DALABWOM3QTHYXLANCNFSM4JKCJEQA.

wfs-lilly commented 5 years ago

linearham_aliva_failed.log partis_succeed_aliva.seq_obs.log

Another request please: can please tell me how to generate the “extras.csv” file? I had a look and the columns made sense, but I couldn’t match them with the sequences in your supplied human germlines. I don’t think I can share the germlines – the data are shared with a collaborator.

Thanks, Bill

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

From: Duncan Ralph notifications@github.com Sent: Monday, November 11, 2019 4:17 PM To: matsengrp/linearham linearham@noreply.github.com Cc: William Smith - Network smith_william1@network.lilly.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [matsengrp/linearham] Docker image missing mafft (#55)

EXTERNAL EMAIL: Use caution before replying, clicking links, and opening attachments.

Well as to the crash, it definitely shouldn't be crashing, I'll think through the code carefully to try to guess how it could end up with a zero length list there. In the meantime it might be easier if you have any way to provide the germline meta info so it doesn't need to do the alignment step. What's happening is the germline dirs normally include a csv file with codon information, and it didn't find one in the germline directory you passed. And it tries to be clever and guess which genes to use as templates and align them properly to get the cyst/tryp/phen positions, but it's not really very clever. If you are able to pass me the germline dir you're using, that would make it a lot easier to fix the underlying issue.

Our names don’t match the partis requirements but that seems to be OK.

yeah that's fine. It just changes them to match the requirements (in this case, it just needs something for the allele, so it adds 'x')

Note that I didn’t specify a species – I assumed that wouldn’t be necessary when providing the germlines.

yes, that's fine. That means it's defaulting to human.

While there certainly are an absurd number of command line arguments, very few of them are really for the purpose of tuning things -- the philosophy is definitely that users will run with default parameters the overwhelming majority of the time. The command line arguments are instead mostly for eliciting non-standard behavior, things analogous to telling it to use a non-default germline directory location.

This was running on my local Windows box under Docker Desktop. The deduplicated file has about 134,000 sequences. (Is there, BTW, any value in providing duplicates? Also, as you can see from the name, I’ve included only sequences which have at least 2 reads. Will that have adverse effects on the analysis?)

Nope, partis will immediately collapse any duplicates that it finds (although it does record number of duplicates in the output file). I can't speak to the optimal minimum number of reads per sequence, except to say that partis does zero of what we call "preprocessing" -- things such as sequencing error correction, umi handling, etc. So partis views every input sequence as an actual BCR, and any mismatch from germline as a somatic hypermutation. I.e. it assumes all preprocessing has already been performed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=ANU5HJRG5BY7JNJEJFHBN3LQTHYXLA5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDYSZUY#issuecomment-552676563, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANU5HJUK44AHS2DALABWOM3QTHYXLANCNFSM4JKCJEQA.

psathyrella commented 5 years ago

You don't need to make the extras.csv, I committed the fix above ^ so the automatic generation won't crash. But it's just got a line for each gene, with an integer with the zero-indexed position of the conserved codon.

I can't see any attachments, since this is a github issue, not an actual email chain. I would recommend commenting on the github web interface here rather than by email.

wfs-lilly commented 5 years ago

Thanks. I attached the logs on the github conversation. I’ll comment there from now on.

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

From: Duncan Ralph notifications@github.com Sent: Tuesday, November 12, 2019 3:51 PM To: matsengrp/linearham linearham@noreply.github.com Cc: William Smith - Network smith_william1@network.lilly.com; Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [matsengrp/linearham] Docker image missing mafft (#55)

EXTERNAL EMAIL: Use caution before replying, clicking links, and opening attachments.

You don't need to make the extras.csv, I committed the fix above ^ so the automatic generation won't crash. But it's just got a line for each gene, with an integer with the zero-indexed position of the conserved codon.

I can't see any attachments, since this is a github issue, not an actual email chain. I would recommend commenting on the github web interface herehttps://github.com/matsengrp/linearham/issues/55 rather than by email.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/matsengrp/linearham/issues/55?email_source=notifications&email_token=ANU5HJRH5A2DTGOJOBBIRODQTM6ORA5CNFSM4JKCJEQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED4LQ4A#issuecomment-553171056, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANU5HJQ4CHMEPOL5PTQ36ADQTM6ORANCNFSM4JKCJEQA.

psathyrella commented 5 years ago

hm, it looks like the input germline sequences are almost entirely Ns, so the results are likely just nonsense. Is that what they look like in the file you're inputting? Or is partis somehow mangling them?

  warning mutliple names in ./germlines/aliva/igh/ighd.fasta for the following sequences:
    IGHD5-5*01 IGHD5-18*01                               GNNNNNNNNNNNNNNNNNNN
  it is highly recommended to remove the duplicates to avoid ambiguity
    renamed 6 genes from ./germlines/aliva/igh/ighj.fasta: IGHJ1 --> IGHJ1*x,  IGHJ2 --> IGHJ2*x,  IGHJ3 --> IGHJ3*x,  IGHJ4 --> IGHJ4*x,  IGHJ5 --> IGHJ5*x,  IGHJ6 --> IGHJ6*x
      missing 6 tryp positions
    adding gene with codon position hj201 from template glfo, to align against new genes
    had to add an N to the right side of template glfo's gene hj201, since it's in <glfo> under a different name hj2x
    adding new allele to glfo:
               CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   hj201_TEMPLATE (no template)
        missing alignments for 7 j genes: hj2x hj5x hj3x hj201_TEMPLATE hj6x hj1x hj4x
          added 7 new alignments
  new alignments:
            ----------CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj2x      <--- new
            ------------ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj5x      <--- new
            -------------TNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj3x      <--- new
            ----------CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   hj201_TEMPLATE  <--- new
            ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj6x      <--- new
            -----------GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj1x      <--- new
            ---------------ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj4x      <--- new

  using known position 19 (aligned 29) from IGHJ2*01_TEMPLATE
            ----------CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   hj201_TEMPLATE
            ----------CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj2x           new
            ------------ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj5x           new
            -------------TNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj3x           new
            ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj6x           new
            -----------GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj1x           new
            ---------------ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN-   hj4x           new
          7 tryp positions:   7 ok
      added 7 tryp positions
      missing 32 cyst positions
    adding gene with codon position hv1-6901 from template glfo, to align against new genes
    adding new allele to glfo:
               CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
        missing alignments for 33 v genes: hv3-33x hv3-66x hv3-23x hv3-15x hv1-2x hv3-30x hv4-4x hv3-64x hv4-34x hv3-7x hv2-70x hv3-72x hv3-73x hv3-74x hv3-21x hv1-46x hv3-20x hv2-26x hv3-43x hv3-49x hv1-45x hv3-13x
          added 33 new alignments
  new alignments:
            CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------TNN---TNNNNNNNNNNNNNNNNNNNNNNNNNN
            GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------CNN---TNNNNN---CNNNNNNNNNNNNNNNNN
            GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------TNN---TNNNNNNNNNNNNNNNNNNNNNNNNNN
            GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN---TNNNNNNNNNNNNNNNNNNNNNNNNNN
            CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------TNN---CNNNNNNNNNNNNNNNNNNNNNNNNNN
            CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------TNN---TNNNNNNNNNNNNNNNNNNNNNNNNNN
            CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN------CNN
wfs-lilly commented 5 years ago

Sorry -- failed to note that I obscured all of the germline sequences in the logs so that I could share the logs with you. I did that by replacing all but the initial letter with N. I hoped that the sequences themselves wouldn't be required to analyze the logs.

William F. Smith Bioinformatician BCforward Lilly Biotechnology Center 10290 Campus Point Dr. San Diego, CA 92121 smith_william1@network.lilly.commailto:smith_william1@network.lilly.com

psathyrella commented 5 years ago

Ah, right, sorry.

A couple things while I'm thinking of them: 1) if you can run partis with --n-procs it'll go a lot faster, and 2) it looks like you're running without --leave-default-germline, which I think based on our previous discussion you want? At least if you know that you have all the genes that are in the actual germline.

Also, that doesn't look like it's the entire partis output? Sorry it's so verbose, that's just because when it's reading the germline info without any meta info it tries to be super verbose about printing the alignments so you can check that they look sensible; the rest of the partis output should be informative.

As to the actual crash, I'm sorting through why I was certain there would be smith-waterman info for every sequence, and why that's failing now, but it undoubtedly boils down to that we mostly tested linearham with the --run-partis option, and running partis separately with different options lets some inconsistency slip in.

psathyrella commented 5 years ago

Also, I can't really tell exactly what's being run, but it appears that you're trying to run linearham on all 130k sequences? If so that, uh, won't work at all. Linearham is running actual real phylogenetics stuff, so it can really only be run on individual lineages, and I'm not sure what the practical maximum number of sequences is in a lineage (@eharkins might know that number), but it's far, far less than 130k. The idea is that you'd run partis on the entire 130k sample in order to infer good parameters, and get good partis annotations for every sequence, but then run linearham only on individual clusters of interest.

wfs-lilly commented 5 years ago

Thanks.

I’ll try upping the n-procs – I’ve had issues with subprocesses crashing quietly but I’ll allocate more RAM and CPU to docker.

I missed the point about leave-default-germline – thanks for showing me again.

You’re quite right about missing partis output – my sequence alteration script dropped the last part. I’ll reupload.

I’m making guesses about linearham inputs, but what you say is obvious in retrospect. How do I iterate over the clusters partis finds?

partis_succeed_aliva.seq_obs.log

psathyrella commented 5 years ago

ok -- don't laugh -- I've actually never run linearham myself, but it looks like it's the --cluster-ind option. Which I guess defaults to 0 (which would usually mean the largest cluster, since partis usually sorts by decreasing size), which honestly I would think it might make sense to make it a required argument, since at least in your case I guess you're silently getting the zeroth cluster whether you want it or not.

But in any case, the crash is happening because it's trying to get linearham info for every sequence in the input file, but (as in most samples) some small fraction of sequences failed annotation. This never happened to us, since we were always running on individual clusters were of particular interest, so had no failed sequences in them. I'll commit something to fix it presently, but might not be able to until the morning.

wfs-lilly commented 5 years ago

Thanks. No snickering here.

How do I get a list of cluster-ind values from the partis output, or do I just start at 1 and work upward until I get an error?

psathyrella commented 5 years ago

oh, wait -- also, you're running partis annotate, which doesn't make sense for input to linearham. You need clusters, annotate just gives annotations for individual sequences. See here.

Any cluster index value from 0 to the length of the partition is valid, but the idea with linearham is that you run it on a cluster in which you are particularly interested. To figure this out if you don't know which cluster, one way would be to use the partis view-output options to look at the partition output.

But I think we should back up a little, I'm not entirely certain that we're on the same page as to what you're trying to do -- the purpose of linearham is to get super accurate naive sequences on a particular cluster. But it seems like since you're running on the entire sample, maybe you're more at the stage where you just want to partition the entire sample and look at the resulting annotations, i.e. you may just want to run partis. Linearham naive sequences are more accurate, but the difference even in the worse case is only a couple bases, and this is only for very imbalanced trees, i.e. it's only a difference you'd care about if you're about to go and synthesize a naive precursor.

Anyway, I also just pushed a fix to avoid the crash you were getting, since either way that shouldn't happen. But docker hub seems to be down now (I'm getting a 404), so until that's resolved you'd have to pull from github and build the docker image yourself if you were so inclined.

wfs-lilly commented 5 years ago

The reason I’m running annotate is that I looked at SConstruct and saw this:

    partis_mode = "annotate --all-seqs-simultaneous" if options["all_clonal_seqs"] else "partition"

And then in the linearham README there’s this for the –run-partis command:

--all-clonal-seqs Should the repertoire sequences be treated as clonal?

It’s very possible I’ve misunderstood. The FASTA file has BCR sequences from an immunized mouse. I don’t know how many of the sequences are naïve, but I want to figure that out and then build lineage trees of all of them and their descendants. We’re especially interested in the maturity level of each sequence with respect to the tree to which it is assigned. Our initial read of the linearham preprint seemed to indicate that it could be a toolset to do these analyses.

Please let me know how I should use the tools for these purposes.

Thanks, Bill

psathyrella commented 5 years ago

Yeah, what that command is doing is using annotate with --all-seqs-simultaneous if it was told on the command line that all seqs were clonal, otherwise it runs partition. Since it sounds like you are running on the whole mouse repertoire, which would be many clones, you'd need to run partition.

Yes, it sounds like you should be using plain partis at the moment, not linearham. Partis does not build a tree for each family by default, since that is not as of 2019 a computationally feasible thing to do on BCR NGS data, but if you just want to know how mutated each sequence is compared to its family, that is part of the default partis output and doesn't require a tree. If you want a quick and dirty tree you can use get-tree-metrics, which I'd recommend running after partitioning, as a separate step, and maybe not on every cluster.

wfs-lilly commented 5 years ago

Thanks – I think I’m beginning to understand.

If I still want to do the in depth lineage analysis of each cluster (or a reasonable subset), could I do the following?

  1. Run partis partition.
  2. Review clusters for basically making sense (get-tree-metrics and/or view-output)
  3. For each cluster that I’m interested in
    • Create separate FASTA of only sequences in the cluster of interest.
    • Run the partis annotate / linearham pipeline

What will happen if I run linearham against the partis partition outputs and specify a cluster-ind? Is that a sensible thing to do?

psathyrella commented 5 years ago

Yes, that sounds like a good way to do it.

That way should also work, without the extra step of creating the fasta file for input to linearham. For instance running view-output on one of the test outputs:

./bin/partis view-output --outfname test/reference-results/partition-new-simu.yaml --abbreviate |less -RS p

the index should correspond to the order in which the clusters appear in the best partition (at the top -- below, the annotations, below, in general appear in many partitions, so their order is not necessarily meaningful). But it's probably easier to access the clusters you want with a couple lines of python, like in this example.