Closed zhuchcn closed 8 months ago
Sorry do you have a description of each of the columns? I need to understand them first XD Edit: nvm found it here: https://github.com/uclahs-cds/package-moPepGen/issues/833#issuecomment-1974291789
For the first two columns I suggest going with sequence
and header
or something from the FASTA terminology for consistency
This is really cool! From a pure user perspective, I'm just wondering why it would be valuable for me to know that this peptide came from 2 nodes?
IQGARATLUF ENST00000614167.2|SNV-1179-G-T|1 0 5 transcript ENST00000614167.2 306 311 2 -2 .
IQGARATLUF ENST00000614167.2|SNV-1179-G-T|1 5 10 transcript ENST00000614167.2 311 316 2 -2 .
As is, what information is lost if the above two entries are grouped together?
Also idk how useful this level of detail is to the average user. Would it be possible to add a verbalcity flag to this? off
concise
detail
or something?
In addition to the case above, for concise
I feel like the following can be summarized into 2 rows, one for the transcript and on for the MNV? I dont like how the MNV breaks up the transcript, and yet the peptide coordinates overlap between the transcript and the variant (obviously it should but it looks werid
NTLSGMQKFMGEDUNFHERK ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1 0 2 transcript ENST00000614167.2 201 203 2 -2 .
NTLSGMQKFMGEDUNFHERK ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1 1 3 . . . . 2 -2 MNV-610-GA-AT
NTLSGMQKFMGEDUNFHERK ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1 2 8 transcript ENST00000614167.2 203 209 2 -2 .
NTLSGMQKFMGEDUNFHERK ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|1 8 19 transcript ENST00000614167.2 209 220 2 -2 .
I changed the column names to sequence
and header
. I made some changes since I wrote the PR. You are right that users don't need to know that a sequence comes from two separate nodes. Having separate segments in the file also increases its size. So I tried my best to merge adjacent segments. This is what it looks like right now. So you can see the first peptide has three segments, an insertion in the middle, and two reference sequences on the two sides. I won't add additional parameters to control this behavior because I don't think that's useful to have that additional information by not merging adjacent segments.
#sequence header start end feature_type feature_id ref_start ref_end start_offset end_offset variant
EATEKARHETLCC ENST00000614167.2|INDEL-399-A-AC|1 0 9 transcript ENST00000614167.2 371 398 0 0 .
EATEKARHETLCC ENST00000614167.2|INDEL-399-A-AC|1 9 10 . . . . 0 1 INDEL-399-A-AC
EATEKARHETLCC ENST00000614167.2|INDEL-399-A-AC|1 9 13 transcript ENST00000614167.2 397 409 2 0 .
EATEKARHE ENST00000614167.2|MNV-399-ACC-TAA|1 0 9 transcript ENST00000614167.2 371 398 0 0 .
ARHETLCC ENST00000614167.2|INDEL-399-A-AC|2 0 4 transcript ENST00000614167.2 386 398 0 0 .
ARHETLCC ENST00000614167.2|INDEL-399-A-AC|2 4 5 . . . . 0 1 INDEL-399-A-AC
ARHETLCC ENST00000614167.2|INDEL-399-A-AC|2 4 8 transcript ENST00000614167.2 397 409 2 0 .
NTLSGMQKFMGEDUNFHER ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|4 0 2 transcript ENST00000614167.2 605 611 0 1 .
NTLSGMQKFMGEDUNFHER ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|4 1 3 . . . . 2 2 MNV-610-GA-AT
NTLSGMQKFMGEDUNFHER ENST00000614167.2|SNV-611-G-A|SNV-612-A-T|4 2 19 transcript ENST00000614167.2 611 662 1 0 .
I'll add this to the file format page. I think the most confusing part is the start_offset
and stop_offset
. Basically ref_start
is the first nucleotide of the codon that the first amino acid matches, and the ref_stop
is the end of the codon that the last amino acid matches. But sometimes an amino acid can harbor a variant. In this example below, YWP
is matched to the nucleotide sequence TTGGCC
(108-114), but the corresponding codons are 106-115, so there are start and end offsets. So 106 + 2 = 108, and 115 - 1 = 114. I would like to have the difference between ref_start
and ref_end
to be 3 folds of the amino acid length.
Running this on a cell line data from a collaborator, the peak memory increased from 15 GB to 30 GB. But this doesn't seem to be related to this PR. Will keep troubleshooting. Maybe something changed about the timeout function.
sorry the image isn't working for me
OK so basically ref_start
ref_stop
gives the full nucleotide sequence of the peptide substring (from the first nucleotide in the first codon to the last nucleotide in the last codon). But the offset gives which specific nucleotides are involved in that "segment" of interest?
That's strange because I can see the image. I was just referring to the image here: https://github.com/uclahs-cds/package-moPepGen/issues/833#issuecomment-1974291789
That's correct. The offset
s are used when there is a variant in the first or last codon.
The memory usage is mysterious. Nextflow's trace.txt says the peak RSS is 30.6 GB, compared to before, when it was only ~ 15 GB on the same dataset. However, I can't reproduce it under interactive mode. Here is what I tried so far. memray is a Python memory profiler. When I ran the same command with memray
(so it can track the memory usage), the memory usage never exceeded 13 GB.
Then I thought maybe memray
doesn't properly track child processes with multiprocessing. So I relaunched the nextflow pipeline, and while it was running, I ssh'ed into the worker node and launched a script to track the memory usage:
while true; do docker stats nxf-wNnk473VZTSYAvVOd3sIXrUv --no-stream | tail -n 1 >> memory_log.txt; sleep 5s; done
The memory usage did get higher but never exceeded 19 GB.
But the trace.txt still says the peak RSS is 30.6 GB. So maybe the peak memory happened transiently within the 5-second interval?
task_id hash native_id name status exit submit duration realtime %cpu peak_rss peak_vmem rchar wchar
1 de/c403c5 27885 resolve_filename_conflict (1) COMPLETED 0 2024-03-05 20:50:09.655 27.4s 16ms 85.0% 0 0 49.2 KB 180 B
4 05/7640d4 27889 resolve_filename_conflict (4) COMPLETED 0 2024-03-05 20:50:09.903 27.2s 14ms 58.2% 0 0 49.2 KB 180 B
3 69/1b9a80 27897 resolve_filename_conflict (3) COMPLETED 0 2024-03-05 20:50:13.502 23.6s 26ms 21.8% 0 0 49.2 KB 180 B
2 44/7b368a 27901 resolve_filename_conflict (2) COMPLETED 0 2024-03-05 20:50:13.675 24.5s 21ms 48.5% 0 0 49.2 KB 180 B
5 50/c26806 29658 call_variant COMPLETED 0 2024-03-05 20:50:42.088 4h 28m 44s 4h 28m 41s 214.8% 30.6 GB 47.9 GB 70.7 GB 43 GB
6 c7/7ca1ad 20934 summarize_fasta COMPLETED 0 2024-03-06 01:19:27.573 2m 50s 2m 38s 102.0% 1.6 GB 2.7 GB 1.8 GB 28.7 KB
I'm going to run the pipeline again but fix the memory limit to 20GB to see whether it fails.
So I ran the nextflow pipeline again by setting the memory limit to 20 GB. I can verify that the parameter is passed to the command correctly.
In the log.command.run
file:
nxf_launch() {
docker run -i --cpu-shares 30720 --memory 20480m -e "NXF_DEBUG=${NXF_DEBUG:=0}" -v /hot/user/czhu/project-BackusLabCollab/data:/hot/user/czhu/project-BackusLabCollab/data -w "$PWD" -u $(id -u):$(id -g) $(for i in `id --real --groups`; do echo -n "--group-add=$i "; done) --shm-size=10.24gb --name $NXF_BOXID ghcr.io/uclahs-cds/mopepgen:unstable /bin/bash /hot/user/czhu/project-BackusLabCollab/data/processed/HCT-15/call-NonCanonicalPeptide/reditools/call/work/11/2cff11a7b73fab5a1ee59c97477372/.command.run nxf_trace
}
The pipeline run was successful, meaning memory usage was never over 20 GB. But the trace.txt still shows the peak_rss is 30.2 GB.
task_id hash native_id name status exit submit duration realtime %cpu peak_rss peak_vmem rchar wchar
4 db/7220c3 6645 resolve_filename_conflict (4) COMPLETED 0 2024-03-06 10:31:25.582 875ms 22ms 66.0% 0 0 49 KB 173 B
2 7d/417da2 6861 resolve_filename_conflict (2) COMPLETED 0 2024-03-06 10:31:25.700 785ms 16ms 79.0% 0 0 49 KB 173 B
3 fa/7e1af7 6994 resolve_filename_conflict (3) COMPLETED 0 2024-03-06 10:31:25.793 697ms 12ms 95.5% 0 0 49 KB 173 B
1 5f/c5bf17 7115 resolve_filename_conflict (1) COMPLETED 0 2024-03-06 10:31:25.875 620ms 30ms 25.2% 0 0 49 KB 173 B
5 11/2cff11 7380 call_variant COMPLETED 0 2024-03-06 10:31:26.721 4h 19m 5s 4h 19m 4s 214.0% 30.2 GB 45 GB 70.6 GB 43 GB
6 cb/908572 27455 summarize_fasta COMPLETED 0 2024-03-06 14:50:32.312 2m 36s 2m 27s 102.1% 1.6 GB 2.7 GB 1.8 GB 28.7 KB
As discussed at https://github.com/uclahs-cds/group-nextflow-working-group/discussions/54#discussioncomment-8700389, the rss_peak
is inaccurate for parallelized processes (like this). So the profiling results are more accurate.
All problems seem to be solved. I'm merging this for now.
Description
As described in #833, an additional table is now saved by
callVariant
with detailed annotation of each segment of the peptide, including where they are mapped to the original transcript, and whether a variant is carried. The table looks like this below.This brings another benefit. Because the peptide information is bigger than just the sequence and fasta header, I have to save them to the disk instead of holding them in the memory until the end. And since this table contains the fasta header, too, the fasta header don't need to stay in memory, either. So with this, only a
dict
of sequence and the pointers to the table is kept in the memory, and the information can be loaded quickly in the end and be converted into FASTA entries to save. So basically this should reduce memory usage, too.Closes #833
Checklist
.png
, .jpeg
),.pdf
,.RData
,.xlsx
,.doc
,.ppt
, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.CHANGELOG.md
under the next release version or unreleased, and updated the date.