Peptide Table saved by callVariant

zhuchcn commented 8 months ago

Description

As described in #833, an additional table is now saved by callVariant with detailed annotation of each segment of the peptide, including where they are mapped to the original transcript, and whether a variant is carried. The table looks like this below.

#seq                   label                                        start  end  feature_type  feature_id         ref_start  ref_end  start_offset  end_offset  variant
IQGARATLUF            ENST00000614167.2|SNV-1179-G-T|1             0      5    transcript    ENST00000614167.2  306        311      2             -2          .
IQGARATLUF            ENST00000614167.2|SNV-1179-G-T|1             5      10   transcript    ENST00000614167.2  311        316      2             -2          .
RIQGARATLUF           ENST00000614167.2|SNV-1179-G-T|2             0      1    transcript    ENST00000614167.2  305        306      2             -2          .
RIQGARATLUF           ENST00000614167.2|SNV-1179-G-T|2             1      6    transcript    ENST00000614167.2  306        311      2             -2          .
RIQGARATLUF           ENST00000614167.2|SNV-1179-G-T|2             6      11   transcript    ENST00000614167.2  311        316      2             -2          .
NTLSGMQKFMGEDUNFHERK  ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1  0      2    transcript    ENST00000614167.2  201        203      2             -2          .
NTLSGMQKFMGEDUNFHERK  ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1  1      3    .             .                  .          .        2             -2          MNV-610-GA-AT
NTLSGMQKFMGEDUNFHERK  ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1  2      8    transcript    ENST00000614167.2  203        209      2             -2          .
NTLSGMQKFMGEDUNFHERK  ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1  8      19   transcript    ENST00000614167.2  209        220      2             -2          .

This brings another benefit. Because the peptide information is bigger than just the sequence and fasta header, I have to save them to the disk instead of holding them in the memory until the end. And since this table contains the fasta header, too, the fasta header don't need to stay in memory, either. So with this, only a dict of sequence and the pointers to the table is kept in the memory, and the information can be loaded quickly in the end and be converted into FASTA entries to save. So basically this should reduce memory usage, too.

Closes #833

Checklist

[X] This PR does NOT contain PHI or germline genetic data. A repo may need to be deleted if such data is uploaded. Disclosing PHI is a major problem.
[X] This PR does NOT contain molecular files, compressed files, output files such as images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.
[X] I have read the code review guidelines and the code review best practice on GitHub check-list.
[X] The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
[X] I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
[X] All test cases passed locally.

lydiayliu commented 8 months ago

Sorry do you have a description of each of the columns? I need to understand them first XD Edit: nvm found it here: https://github.com/uclahs-cds/package-moPepGen/issues/833#issuecomment-1974291789

For the first two columns I suggest going with sequence and header or something from the FASTA terminology for consistency

lydiayliu commented 8 months ago

This is really cool! From a pure user perspective, I'm just wondering why it would be valuable for me to know that this peptide came from 2 nodes?

IQGARATLUF            ENST00000614167.2|SNV-1179-G-T|1             0      5    transcript    ENST00000614167.2  306        311      2             -2          .
IQGARATLUF            ENST00000614167.2|SNV-1179-G-T|1             5      10   transcript    ENST00000614167.2  311        316      2             -2          .

As is, what information is lost if the above two entries are grouped together?

Also idk how useful this level of detail is to the average user. Would it be possible to add a verbalcity flag to this? off concise detail or something?

In addition to the case above, for concise I feel like the following can be summarized into 2 rows, one for the transcript and on for the MNV? I dont like how the MNV breaks up the transcript, and yet the peptide coordinates overlap between the transcript and the variant (obviously it should but it looks werid

NTLSGMQKFMGEDUNFHERK  ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1  0      2    transcript    ENST00000614167.2  201        203      2             -2          .
NTLSGMQKFMGEDUNFHERK  ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1  1      3    .             .                  .          .        2             -2          MNV-610-GA-AT
NTLSGMQKFMGEDUNFHERK  ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1  2      8    transcript    ENST00000614167.2  203        209      2             -2          .
NTLSGMQKFMGEDUNFHERK  ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1  8      19   transcript    ENST00000614167.2  209        220      2             -2          .

zhuchcn commented 8 months ago

I changed the column names to sequence and header. I made some changes since I wrote the PR. You are right that users don't need to know that a sequence comes from two separate nodes. Having separate segments in the file also increases its size. So I tried my best to merge adjacent segments. This is what it looks like right now. So you can see the first peptide has three segments, an insertion in the middle, and two reference sequences on the two sides. I won't add additional parameters to control this behavior because I don't think that's useful to have that additional information by not merging adjacent segments.

#sequence                  header                                                start  end  feature_type  feature_id         ref_start  ref_end  start_offset  end_offset  variant
EATEKARHETLCC              ENST00000614167.2|INDEL-399-A-AC|1                    0      9    transcript    ENST00000614167.2  371        398      0             0           .
EATEKARHETLCC              ENST00000614167.2|INDEL-399-A-AC|1                    9      10   .             .                  .          .        0             1           INDEL-399-A-AC
EATEKARHETLCC              ENST00000614167.2|INDEL-399-A-AC|1                    9      13   transcript    ENST00000614167.2  397        409      2             0           .
EATEKARHE                  ENST00000614167.2|MNV-399-ACC-TAA|1                   0      9    transcript    ENST00000614167.2  371        398      0             0           .
ARHETLCC                   ENST00000614167.2|INDEL-399-A-AC|2                    0      4    transcript    ENST00000614167.2  386        398      0             0           .
ARHETLCC                   ENST00000614167.2|INDEL-399-A-AC|2                    4      5    .             .                  .          .        0             1           INDEL-399-A-AC
ARHETLCC                   ENST00000614167.2|INDEL-399-A-AC|2                    4      8    transcript    ENST00000614167.2  397        409      2             0           .
NTLSGMQKFMGEDUNFHER        ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|4           0      2    transcript    ENST00000614167.2  605        611      0             1           .
NTLSGMQKFMGEDUNFHER        ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|4           1      3    .             .                  .          .        2             2           MNV-610-GA-AT
NTLSGMQKFMGEDUNFHER        ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|4           2      19   transcript    ENST00000614167.2  611        662      1             0           .

I'll add this to the file format page. I think the most confusing part is the start_offset and stop_offset. Basically ref_start is the first nucleotide of the codon that the first amino acid matches, and the ref_stop is the end of the codon that the last amino acid matches. But sometimes an amino acid can harbor a variant. In this example below, YWP is matched to the nucleotide sequence TTGGCC (108-114), but the corresponding codons are 106-115, so there are start and end offsets. So 106 + 2 = 108, and 115 - 1 = 114. I would like to have the difference between ref_start and ref_end to be 3 folds of the amino acid length.

zhuchcn commented 8 months ago

Running this on a cell line data from a collaborator, the peak memory increased from 15 GB to 30 GB. But this doesn't seem to be related to this PR. Will keep troubleshooting. Maybe something changed about the timeout function.

lydiayliu commented 8 months ago

sorry the image isn't working for me

OK so basically ref_start ref_stop gives the full nucleotide sequence of the peptide substring (from the first nucleotide in the first codon to the last nucleotide in the last codon). But the offset gives which specific nucleotides are involved in that "segment" of interest?

zhuchcn commented 8 months ago

That's strange because I can see the image. I was just referring to the image here: https://github.com/uclahs-cds/package-moPepGen/issues/833#issuecomment-1974291789

That's correct. The offsets are used when there is a variant in the first or last codon.

zhuchcn commented 8 months ago

The memory usage is mysterious. Nextflow's trace.txt says the peak RSS is 30.6 GB, compared to before, when it was only ~ 15 GB on the same dataset. However, I can't reproduce it under interactive mode. Here is what I tried so far. memray is a Python memory profiler. When I ran the same command with memray (so it can track the memory usage), the memory usage never exceeded 13 GB.

Then I thought maybe memray doesn't properly track child processes with multiprocessing. So I relaunched the nextflow pipeline, and while it was running, I ssh'ed into the worker node and launched a script to track the memory usage:

while true; do docker stats nxf-wNnk473VZTSYAvVOd3sIXrUv --no-stream | tail -n 1 >> memory_log.txt; sleep 5s; done

The memory usage did get higher but never exceeded 19 GB.

But the trace.txt still says the peak RSS is 30.6 GB. So maybe the peak memory happened transiently within the 5-second interval?

task_id  hash       native_id  name                           status     exit  submit                   duration    realtime    %cpu    peak_rss  peak_vmem  rchar    wchar
1        de/c403c5  27885      resolve_filename_conflict (1)  COMPLETED  0     2024-03-05 20:50:09.655  27.4s       16ms        85.0%   0         0          49.2 KB  180 B
4        05/7640d4  27889      resolve_filename_conflict (4)  COMPLETED  0     2024-03-05 20:50:09.903  27.2s       14ms        58.2%   0         0          49.2 KB  180 B
3        69/1b9a80  27897      resolve_filename_conflict (3)  COMPLETED  0     2024-03-05 20:50:13.502  23.6s       26ms        21.8%   0         0          49.2 KB  180 B
2        44/7b368a  27901      resolve_filename_conflict (2)  COMPLETED  0     2024-03-05 20:50:13.675  24.5s       21ms        48.5%   0         0          49.2 KB  180 B
5        50/c26806  29658      call_variant                   COMPLETED  0     2024-03-05 20:50:42.088  4h 28m 44s  4h 28m 41s  214.8%  30.6 GB   47.9 GB    70.7 GB  43 GB
6        c7/7ca1ad  20934      summarize_fasta                COMPLETED  0     2024-03-06 01:19:27.573  2m 50s      2m 38s      102.0%  1.6 GB    2.7 GB     1.8 GB   28.7 KB

I'm going to run the pipeline again but fix the memory limit to 20GB to see whether it fails.

zhuchcn commented 8 months ago

So I ran the nextflow pipeline again by setting the memory limit to 20 GB. I can verify that the parameter is passed to the command correctly.

In the log.command.run file:

nxf_launch() {
docker run -i --cpu-shares 30720 --memory 20480m -e "NXF_DEBUG=${NXF_DEBUG:=0}" -v /hot/user/czhu/project-BackusLabCollab/data:/hot/user/czhu/project-BackusLabCollab/data -w "$PWD" -u $(id -u):$(id -g) $(for i in `id --real --groups`; do echo -n "--group-add=$i "; done) --shm-size=10.24gb --name $NXF_BOXID ghcr.io/uclahs-cds/mopepgen:unstable /bin/bash /hot/user/czhu/project-BackusLabCollab/data/processed/HCT-15/call-NonCanonicalPeptide/reditools/call/work/11/2cff11a7b73fab5a1ee59c97477372/.command.run nxf_trace
}

The pipeline run was successful, meaning memory usage was never over 20 GB. But the trace.txt still shows the peak_rss is 30.2 GB.

task_id  hash       native_id  name                           status     exit  submit                   duration   realtime   %cpu    peak_rss  peak_vmem  rchar    wchar
4        db/7220c3  6645       resolve_filename_conflict (4)  COMPLETED  0     2024-03-06 10:31:25.582  875ms      22ms       66.0%   0         0          49 KB    173 B
2        7d/417da2  6861       resolve_filename_conflict (2)  COMPLETED  0     2024-03-06 10:31:25.700  785ms      16ms       79.0%   0         0          49 KB    173 B
3        fa/7e1af7  6994       resolve_filename_conflict (3)  COMPLETED  0     2024-03-06 10:31:25.793  697ms      12ms       95.5%   0         0          49 KB    173 B
1        5f/c5bf17  7115       resolve_filename_conflict (1)  COMPLETED  0     2024-03-06 10:31:25.875  620ms      30ms       25.2%   0         0          49 KB    173 B
5        11/2cff11  7380       call_variant                   COMPLETED  0     2024-03-06 10:31:26.721  4h 19m 5s  4h 19m 4s  214.0%  30.2 GB   45 GB      70.6 GB  43 GB
6        cb/908572  27455      summarize_fasta                COMPLETED  0     2024-03-06 14:50:32.312  2m 36s     2m 27s     102.1%  1.6 GB    2.7 GB     1.8 GB   28.7 KB

zhuchcn commented 8 months ago

As discussed at https://github.com/uclahs-cds/group-nextflow-working-group/discussions/54#discussioncomment-8700389, the rss_peak is inaccurate for parallelized processes (like this). So the profiling results are more accurate.

zhuchcn commented 8 months ago

All problems seem to be solved. I'm merging this for now.

uclahs-cds / package-moPepGen

Peptide Table saved by callVariant #856

Description

Checklist