immunomind / immunarch

🧬 Immunarch: an R Package for Fast and Painless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires
https://immunarch.com
Apache License 2.0
312 stars 65 forks source link

Remove non-productive sequences #286

Open anabbi opened 2 years ago

anabbi commented 2 years ago

Hello and thank you for your great work!

I input my MIXCR files keeping the non-productive sequences using the command below: int_load <- repLoad("~/cloneslinks", .mode = "single", .coding = FALSE)

Now I want to generate a subset of $data containing productive-only sequences. This command below does not seem to subset the $data or $meta: int_load_productive <- repFilter(int_load, .method = "by.clonotype", .query = list(CDR3.aa =exclude("partial", "out_of_frame")))

I am using immunarch_0.6.7. Can you please advise on how to do this? I tried to find out how you define partial or out_of_frame, but I did not find them in your docs.

Thanks Arash

MVolobueva commented 2 years ago

Hi @anabbi !

Thank you for using our packages and contacting us!

We call sequence as partial if we did not find stop codon at all. We call sequence out of frame if we see stop codon in CDR3 region.

This command below does not seem to subset the $data or $meta:

I suppose you are rigth. We will fix it in future versions.

Now I want to generate a subset of $data containing productive-only sequences. For your purpose use this command:

noncoding(int_load)

Thanks again for contucting us. We will back with new updates about your issue.

Enjoy your week, Maria Samokhina

Alexander230 commented 2 years ago

Hi, @anabbi!

I can add that the command repFilter(int_load, .method = "by.clonotype", .query = list(CDR3.aa =exclude("partial", "out_of_frame"))) filters out only the rows where CDR3.aa column has the values exactly partial and out_of_frame; it doesn't filter out all non-productive sequences. repFilter() command is not intended to analyze the data, it is for filtering by already calculated values. To filter out non-productive sequences, use noncoding command, as Maria recommended.

Best regards, Aleksandr