rnabioco / djvdj

An R package to analyze single-cell V(D)J data
https://rnabioco.github.io/djvdj
Other
24 stars 4 forks source link

Fixed erroneous column guessing in import_vdj #116

Closed jowkar closed 1 year ago

jowkar commented 1 year ago

The function import_vdj() can sometimes fail due to erroneous guessing (by readr::cols()) of columns in the input data files "filtered_contig_annotations.csv" and "airr_rearrangement.tsv" related to the d gene. Most likely this occur due to these genes being called more rarely by CellRanger, and if too little information related to these are included in the input files, the function seems to believe that these columns should have logical rather than character values. This results in the following warning, and downstream errors in the function:

Warning message:                                                                                                                     
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)

Checking the guessed column types with "readr::spec(res)" on the loaded data from "filtered_contig_annotations.csv" gives the following result:

readr::spec(dat)
cols(
  barcode = col_character(),
  is_cell = col_logical(),
  contig_id = col_character(),
  high_confidence = col_logical(),
  length = col_double(),
  chain = col_character(),
  v_gene = col_character(),
  d_gene = col_logical(),
  j_gene = col_character(),
  c_gene = col_character(),
  full_length = col_logical(),
  productive = col_logical(),
  fwr1 = col_character(),
  fwr1_nt = col_character(),
  cdr1 = col_character(),
  cdr1_nt = col_character(),
  fwr2 = col_character(),
  fwr2_nt = col_character(),
  cdr2 = col_character(),
  cdr2_nt = col_character(),
  fwr3 = col_character(),
  fwr3_nt = col_character(),
  cdr3 = col_character(),
  cdr3_nt = col_character(),
  fwr4 = col_character(),
  fwr4_nt = col_character(),
  reads = col_double(),
  umis = col_double(),
  raw_clonotype_id = col_character(),
  raw_consensus_id = col_character(),
  exact_subclonotype_id = col_double()

Note " d_gene = col_logical()" here.

Similarly, for the airr data object:

cols(
  cell_id = col_character(),
  clone_id = col_character(),
  sequence_id = col_character(),
  sequence = col_character(),
  sequence_aa = col_character(),
  productive = col_logical(),
  rev_comp = col_logical(),
  v_call = col_character(),
  v_cigar = col_character(),
  d_call = col_logical(),
  d_cigar = col_logical(),
  j_call = col_character(),
  j_cigar = col_character(),
  c_call = col_character(),
  c_cigar = col_character(),
  sequence_alignment = col_character(),
  germline_alignment = col_character(),
  junction = col_character(),
  junction_aa = col_character(),
  junction_length = col_double(),
  junction_aa_length = col_double(),
  v_sequence_start = col_double(),
  v_sequence_end = col_double(),
  d_sequence_start = col_logical(),
  d_sequence_end = col_logical(),
  j_sequence_start = col_double(),
  j_sequence_end = col_double(),
  c_sequence_start = col_double(),
  c_sequence_end = col_double(),
  consensus_count = col_double(),
  duplicate_count = col_double(),
  is_cell = col_logical()
)

Here, "d_call = col_logical()", "d_cigar = col_logical()", "d_sequence_start = col_logical()" and "d_sequence_end = col_logical()" differ from how columns for the v, j and c genes are treated.

The fix suggested here uses hard coding of column data types to avoid this issue. If there is any reason to need column data type guessing for future flexibility, perhaps some other solution could be implemented (such as a function input argument for column specification with default argument set to either readr::cols() or some hard coded choice), but this should solve it for now, and seems to do so in brief tests made using my own data.

Some other minor changes related to deprecation of ".data$"-style arguments to dplyr and tidyr functions that were throwing warnings related to that were also made.

jayhesselberth commented 1 year ago

Alternatively we could use the compact format (here for filtered_contig_annotations.csv):

  res <- purrr::map(
    res,
    readr::read_csv,
    col_types = 'cccccllccccccccccccccllddddddl',
    progress  = FALSE
  )

How often do these formats change? Maybe note the current version.

jowkar commented 1 year ago

That might indeed be a more elegant solution, although maybe naming the columns explicitly might help in case they switch up the format again by introducing another column, so that they end up in a different order. But perhaps read_csv would complain anyway then. (Note though that different column specifications are needed for filtered_contig_annotations.csv and the airr file.)

The version of CellRanger I used here was 7.0.1, and djvdj v. 0.0.0.9000 (although this particular function looks more or less the same compared to the latest development version). I ran several samples with CellRanger 7.0.1 and with the same settings. Some of these samples were handled fine by import_vdj, but at least one sample triggered this problem. So I don't even know if 10x have changed the format recently, it could just be that there are cases like this where readr::cols() fails to predict the right datatype for the D-gene columns, since most rows tend to be empty/NA/None there. readr::cols() might simply sample the first N rows to make its prediction, and if those are empty/NA, it will just guess that it is of datatype logical, or something.

sheridar commented 1 year ago

Thanks @jowkar, really appreciate the feedback. The AIRR format shouldn't really change since this is fairly standardized. However, the format of the csv file has changed some between different versions of cellranger. Ideally we would only hard code the columns that have issues. I want to look into this a little more before merging

jowkar commented 1 year ago

Sure, I have a feeling that there should be some smarter way to solve this that is a compromise between hard coding everything and pure prediction. Like extracting the column predictions first and running some sanity check on important or known problematic columns perhaps, and then feeding that to read_csv.

sheridar commented 1 year ago

Thanks @jowkar! If you encounter any other issues or have any feature requests please feel free to file an issue