immunomind / immunarch

🧬 Immunarch: an R Package for Fast and Painless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires
https://immunarch.com
Apache License 2.0
306 stars 65 forks source link

Error in strsplit(df[[.dalignments]], "|", TRUE, FALSE, TRUE) #79

Open chenyx47 opened 4 years ago

chenyx47 commented 4 years ago

🐛 Bug

I get this error when I use repLoad(/path/to/mixcrclonesoutputfile.txt).

Error in strsplit(df[[.dalignments]], "|", TRUE, FALSE, TRUE) : non-character argument

I recheck the data and find the error arise when the input data have only one row like this (omiting other columns):

cloneId cloneCount cloneFraction targetSequences targetQualities allVHitsWithScore allDHitsWithScore
0 17.0 1.0 TGTGCCAGTAGTATAGACGGTTCATCTGGAAACACCATATATTTT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

To Reproduce

Steps to reproduce the behavior:

1.repLoad(/path/to/mixcrclonesoutputfile.txt). 2.Error in strsplit(df[[.dalignments]], "|", TRUE, FALSE, TRUE) : non-character argument 3.

Expected behavior

Additional context

vadimnazarov commented 4 years ago

Hi @chenyx47

Thank you! In the package we addressed the common case of repertoire files where all columns are available. Would you be willing to tell me more about the reasons to strip some columns from the output? It will greatly help us improve the package and/or provide recommendations how to deal with this type of bug.

chenyx47 commented 4 years ago

Thanks for your reply, but you might misunderstand my question. The question is that when I use repLoad(/path/to/mixcrclonesoutputfile.txt), I met this error.

Error in strsplit(df[[.dalignments]], "|", TRUE, FALSE, TRUE) : non-character argument

Thus, I recheck my input data, and find that only when the input data (Mixcr output format) only one row does the error arise. And I paste an example here to inform you the format of the input data which would lead to the error. But becasue of the space I only paste part of the columns. Actually, all of the columns are available.

cloneId cloneCount cloneFraction targetSequences targetQualities allVHitsWithScore allDHitsWithScore 0 17.0 1.0 TGTGCCAGTAGTATAGACGGTTCATCTGGAAACACCATATATTTT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

In contrast, neither the input data with more than one row nor with zero row would lead to the error. More than one row: cloneId cloneCount cloneFraction targetSequences targetQualities allVHitsWithScore allDHitsWithScore 0 17.0 1.0 TGTGCCAGTAGTATAGACGGTTCATCTGGAAACACCATATATTTT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 0 17.0 1.0 TGTGCCAGTAGTATAGACGGTTCATCTGGAAACACCATATATTTT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF Zero row: cloneId cloneCount cloneFraction targetSequences targetQualities allVHitsWithScore allDHitsWithScore

vadimnazarov commented 4 years ago

Hi @chenyx47

Thank you! It's clear to me now, we will see what we can do. Is this a blocker for your research? It seems that having a file with only one row is unusual.

chenyx47 commented 4 years ago

Hi @vadimnazarov Thanks for your reply. Getting a file with only one row seems weird for me too. It might be attributed to that my analysis was based on bulk-RNAseq data. For now it is a blocker for my research, so I will appreciate it if you could offer me a solution.

sciencepeak commented 4 years ago

Hi, @vadimnazarov

I am also working on bulk-RNA-seq data to find T cells from tumor samples. To me it is usual to have a mixcr result file that has only two lines. After load a sample with repLoad(), I got the same error message as the above researcher.

> input_path <- "C:\\Users\\Andy\\Desktop\\56.filtered.txt"
>             immune_data_list <- repLoad(input_path)

== Step 1/3: loading repertoire files... ==

Processing "<initial>" ...
  -- Parsing "C:\Users\Andy\Desktop\56.filtered.txt" -- mixcr
Error in strsplit(df[[.dalignments]], "|", TRUE, FALSE, TRUE) : 
  non-character argument

Following is my two line productive cdr3aa result.

cloneId cloneCount  cloneFraction   targetSequences targetQualities allVHitsWithScore   allDHitsWithScore   allJHitsWithScore   allCHitsWithScore   allVAlignments  allDAlignments  allJAlignments  allCAlignments  nSeqFR1 minQualFR1  nSeqCDR1    minQualCDR1 nSeqFR2 minQualFR2  nSeqCDR2    minQualCDR2 nSeqFR3 minQualFR3  nSeqCDR3    minQualCDR3 nSeqFR4 minQualFR4  aaSeqFR1    aaSeqCDR1   aaSeqFR2    aaSeqCDR2   aaSeqFR3    aaSeqCDR3   aaSeqFR4    refPoints
1   2   0.666666666666667   TGTGCTAGTGGTTGGGGGACCTACAATGAGCAGTTCTTC JJJJJJJJJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJJ TRBV12-5*00(296)        TRBJ2-1*00(240) TRBC2*00(925)   273|289|310|0|16|ST286G|66.0        22|42|70|19|39||100.0                                               TGTGCTAGTGGTTGGGGGACCTACAATGAGCAGTTCTTC 32                              CASGWGTYNEQFF       :::::::::0:-1:16:::::19:-2:39:::
2   1   0.333333333333333   TGTGCCAGCAGTAGGACCCCGACCTACGAGCAGTACTTC JJJJJFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ TRBV6-5*00(785)     TRBJ2-7*00(225) TRBC2*00(295),TRBC1*00(267) 270|282|307|0|12||60.0      22|39|67|22|39||85.0    ;                                           TGTGCCAGCAGTAGGACCCCGACCTACGAGCAGTACTTC 37                              CASSRTPTYEQYF       :::::::::0:-5:12:::::22:-2:39:::
mjrmason commented 3 years ago

FYI I also have the same issue where I am running mixcr on bulk, paired end RNA-Seq from tumor samples and have a handful of repertoires with only one clone. This is with extremely hi depths reads btw. immunarch repLoad poops out here for me too.

samkleeman1 commented 3 years ago

Did anyone find a solution for this?

adrlar commented 3 years ago

Hi, I am having the exact same issue as others in this thread. I have been digging in the code a little to see if I could find a root of the problem. The repLoad function decides which kind of file is being imported and uses different parsers for different data, so the error here is coming from the mixcr_parse() function.

The error happens when the parser determines it is a VDJ recombination type (for example TRBV clones) but mixcr has no info in the D alignment column for all clones.

    # check for VJ or VDJ recombination
    # VJ / VDJ / Undeterm
    recomb_type <- "Undeterm"
    if (sum(substr(head(df)[[.vgenes]], 1, 4) %in% c("TCRA", "TRAV", "TRGV", "IGKV", "IGLV"))) {
        recomb_type <- "VJ"
    } else if (sum(substr(head(df)[[.vgenes]], 1, 4) %in% c("TCRB", "TRBV", "TRDV", "IGHV"))) {
        recomb_type <- "VDJ"
    }

which is followed by a check for which recomb_type its parsing and in the case for "VDJ" is the offending strsplits

    if (recomb_type == "VJ") {
      df$VD.insertions <- -1
    } else if (recomb_type == "VDJ") {
      logic <- sapply(strsplit(df[[.dalignments]], "|", TRUE, FALSE, TRUE), length) >= 4 &
        sapply(strsplit(df[[.vend]], "|", TRUE, FALSE, TRUE), length) >= 5
      df$VD.insertions[logic] <-
        as.numeric(sapply(strsplit(df[[.dalignments]][logic], "|", TRUE, FALSE, TRUE), "[[", 4)) -
        as.numeric(sapply(strsplit(df[[.vend]][logic], "|", TRUE, FALSE, TRUE), "[[", 5)) - 1
    }

Note that a strsplit like this will split each element in the vector its been given, and it while it will accept some NA elements, it does complain if all elements are NA. Following is some R console testing I just did to confirm and also how to solve it by forcing a cast to character vector:

> strsplit(c("a|b|c|d|", NA), "|", TRUE, FALSE, TRUE)
[[1]]
[1] "a" "b" "c" "d"

[[2]]
[1] NA

> strsplit(c(NA, NA), "|", TRUE, FALSE, TRUE)
Error in strsplit(c(NA, NA), "|", TRUE, FALSE, TRUE) : 
  non-character argument
> strsplit(as.character(c(NA, NA)), "|", TRUE, FALSE, TRUE)
[[1]]
[1] NA

[[2]]
[1] NA

> str(c(NA, NA))
 logi [1:2] NA NA
> str(c("TEST", NA))
 chr [1:2] "TEST" NA
> str(as.character(c(NA, NA)))
 chr [1:2] NA NA

Finally, note that there are other splits like these further in the parsing code which produce the exact same error of course. Since I don't know the code at all I cant say for sure that this is a good idea but it seems like the solution is to make sure R knows this is a character vector. Perhaps someone can have a look at this again?

Thanks!

owenwilkins commented 3 years ago

having the same issue and would also appreciate it being addressed

AimSchina commented 3 years ago

Having this issue as well, it would be nice if a solution comes up!:) thank you!

marisolbc commented 2 years ago

I'm having this issue too. It would be nice to have a solution. Thank you in advance :)

samkleeman1 commented 2 years ago

Same issue ongoing for me... Files with one line are rejected

ghost commented 2 years ago

Same issue, files with one line (one clonotype) is rejected.

plezar commented 2 years ago

Hi! I was having the same problem and had a quick look at the code. The following block of code essentially determines the size of DJ insertion from D and J alignment encoding, and the one directly preceding it does the same for VD insertion.

    .dj.insertions <- "DJ.insertions"
    df$DJ.insertions <- -1
    if (recomb_type == "VJ") {
      df$DJ.insertions <- -1
    } else if (recomb_type == "VDJ") {
      logic <- sapply(strsplit(df[[.jstart]], "|", TRUE, FALSE, TRUE), length) >= 4 &
        sapply(strsplit(df[[.dalignments]], "|", TRUE, FALSE, TRUE), length) >= 5
      df$DJ.insertions[logic] <-
        as.numeric(sapply(strsplit(df[[.jstart]][logic], "|", TRUE, FALSE, TRUE), "[[", 4)) -
        as.numeric(sapply(strsplit(df[[.dalignments]][logic], "|", TRUE, FALSE, TRUE), "[[", 5)) - 1
    }

I suppose -1 means that the insertion is either undefined because of the VJ recombination type, or cannot be determined with high degree of confidence (which is the case when D alignment encoding is missing). So I slightly modified the logic in the above if statements so that the offending strsplits are not executed if all elements in df[[.dalignments]] are NA.

    if (recomb_type == "VJ" | all(is.na(df[[.dalignments]]))) {
      df$VD.insertions <- -1
    } else if (recomb_type == "VDJ") {
      logic <- sapply(strsplit(df[[.dalignments]], "|", TRUE, FALSE, TRUE), length) >= 4 &
        sapply(strsplit(df[[.vend]], "|", TRUE, FALSE, TRUE), length) >= 5
      df$VD.insertions[logic] <-
        as.numeric(sapply(strsplit(df[[.dalignments]][logic], "|", TRUE, FALSE, TRUE), "[[", 4)) -
        as.numeric(sapply(strsplit(df[[.vend]][logic], "|", TRUE, FALSE, TRUE), "[[", 5)) - 1
    }

    .dj.insertions <- "DJ.insertions"
    df$DJ.insertions <- -1
    if (recomb_type == "VJ" | all(is.na(df[[.dalignments]]))) {
      df$DJ.insertions <- -1
    } else if (recomb_type == "VDJ") {
      logic <- sapply(strsplit(df[[.jstart]], "|", TRUE, FALSE, TRUE), length) >= 4 &
        sapply(strsplit(df[[.dalignments]], "|", TRUE, FALSE, TRUE), length) >= 5
      df$DJ.insertions[logic] <-
        as.numeric(sapply(strsplit(df[[.jstart]][logic], "|", TRUE, FALSE, TRUE), "[[", 4)) -
        as.numeric(sapply(strsplit(df[[.dalignments]][logic], "|", TRUE, FALSE, TRUE), "[[", 5)) - 1
    }

I don't know if and how exactly this affects things downstream, but this worked for me and I guess would suffice for the time being. I forked the repo and modified it so if anyone is interested please check the repo here.

samkleeman1 commented 2 years ago

It would appear that you have fixed it! I downloaded your forked repo and it works great. Thanks so much and have a great day!

Alexander230 commented 2 years ago

Hi, @plezar! My name is Aleksandr Popov, I am a developer of the Immunarch package.

Thank you very much for this bugfix! I will merge it into dev branch in the upstream, so it will be included in the next release of Immunarch.

Good luck, Aleksandr

jdm204 commented 2 years ago

@plezar's fork works for me (thanks!) but the current dev branch in this repo fails with:

Error in tbl_subset2(x, j = i, j_arg = substitute(i)) : 
  object '.dalignments' not found
Alexander230 commented 2 years ago

Hi, @jdm204! Thank you for using our software!

I've added fix for this error to dev branch, now it should work as expected.

Best regards, Aleksandr