morrislab / qapa

RNA-seq Quantification of Alternative Polyadenylation
GNU General Public License v3.0
42 stars 10 forks source link

Incompatibility with latest data.table version (1.12.2) #13

Open bepoli opened 5 years ago

bepoli commented 5 years ago

Hello, I just want to report an incompatibility with the latest version of data.table (recently published in CRAN). Using data.table=1.12.0, I usually get a stderr like this when computing the PAU values from Salmon counts:

[qapa] Version 1.2.1
Merging samples by TPM
  |======================================================================| 100%
Separating Ensembl IDs
Adding Ensembl metadata
Found 76575 / 76575 (100%) matches
Warning messages:
1: In `[.data.table`(df, , `:=`(c("Transcript", "Gene", "Species",  :
  Supplied 9 columns to be assigned a list (length 11) of values (2 unused)
2: In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
3: In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
4: In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion

Finished merging data
Melting data frame
Operating on forward strand
Calculating Poly(A) Usage
         1714416 rows, 8046 genes

Operating on reverse strand
Calculating Poly(A) Usage
         1654840 rows, 7904 genes

Adding input expression values

Finished computing PAU!
[qapa] Finished!

and I get a NA value where the count of all the isoforms is zero in a given sample.

However, since data.table version 1.12.2, this behaviour changed:

Merging samples by TPM
  |======================================================================| 100%
Separating Ensembl IDs
Error in `[.data.table`(df, , `:=`(c("Transcript", "Gene", "Species",  :
  Supplied 9 columns to be assigned 11 items. Please see NEWS for v1.12.2.
Calls: separate_ensembl_field -> [ -> [.data.table
Execution halted
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
  no lines available in input
Calls: data.table -> read.csv -> read.table
Execution halted
[qapa] Version 1.2.1
[qapa] Finished!

The execution is halted and the resulting output file is empty. Also, it's worth mentioning that qapa quant still returns zero-exit status (so it won't halt a pipeline running in the background).

Have a good day

kcha commented 5 years ago

Hi @benplm

Thank you for your interest in QAPA and for reporting this incompatibility with data.table 1.12.2. It's not clear to me what change in data.table is causing this issue. If possible, can you e-mail me (k.ha -at- mail.utoronto.ca) a some sample count files that I can use to try to replicate the problem?

kcha commented 5 years ago

Closing this issue as I have been unable to reproduce any error related to data.table 1.12.2. If it is still a problem, feel free to reopen.

songrunxian commented 2 years ago

Hello, I just want to report an incompatibility with the latest version of data.table (recently published in CRAN). Using data.table=1.12.0, I usually get a stderr like this when computing the PAU values from Salmon counts:

[qapa] Version 1.2.1
Merging samples by TPM
  |======================================================================| 100%
Separating Ensembl IDs
Adding Ensembl metadata
Found 76575 / 76575 (100%) matches
Warning messages:
1: In `[.data.table`(df, , `:=`(c("Transcript", "Gene", "Species",  :
  Supplied 9 columns to be assigned a list (length 11) of values (2 unused)
2: In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
3: In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
4: In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion

Finished merging data
Melting data frame
Operating on forward strand
Calculating Poly(A) Usage
         1714416 rows, 8046 genes

Operating on reverse strand
Calculating Poly(A) Usage
         1654840 rows, 7904 genes

Adding input expression values

Finished computing PAU!
[qapa] Finished!

and I get a NA value where the count of all the isoforms is zero in a given sample.

However, since data.table version 1.12.2, this behaviour changed:

Merging samples by TPM
  |======================================================================| 100%
Separating Ensembl IDs
Error in `[.data.table`(df, , `:=`(c("Transcript", "Gene", "Species",  :
  Supplied 9 columns to be assigned 11 items. Please see NEWS for v1.12.2.
Calls: separate_ensembl_field -> [ -> [.data.table
Execution halted
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
  no lines available in input
Calls: data.table -> read.csv -> read.table
Execution halted
[qapa] Version 1.2.1
[qapa] Finished!

The execution is halted and the resulting output file is empty. Also, it's worth mentioning that qapa quant still returns zero-exit status (so it won't halt a pipeline running in the background).

Have a good day

hi my friend have U solve this problem?

rasimbarutcu commented 2 years ago

Hi Kevin,

I get the exact same error and it does seem to be related to the R-data.table update starting from v1.12.2. Please see the new features for v1.12.2 (https://cran.r-project.org/web/packages/data.table/news/news.html).

I am pasting the error below. Thanks!

Merging samples by TPM |======================================================================| 100% Separating Ensembl IDs Error in[.data.table(df, ,:=(c("Transcript", "Gene", "Species", : Supplied 9 columns to be assigned 11 items. Please see NEWS for v1.12.2. Calls: separate_ensembl_field -> [ -> [.data.table Execution halted qapa.qapa - 2022-07-21 10:55:27,728 - INFO - compute_pau.R -e intermediate.txt Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input Calls: data.table -> read.csv -> read.table Execution halted qapa.qapa - 2022-07-21 10:55:28,040 - INFO - Finished!

NJU-Bio-Info commented 1 year ago

I have got the same error message:

Merging samples by TPM
  |======================================================================| 100%
Separating Ensembl IDs
Error in `[.data.table`(df, , `:=`(c("Transcript", "Gene", "Species",  : 
  Supplied 9 columns to be assigned 11 items. Please see NEWS for v1.12.2.
Calls: separate_ensembl_field -> [ -> [.data.table
Execution halted
qapa.qapa     - 2022-12-22 14:11:58,524 - INFO     - compute_pau.R -e /tmp/qapa_merge_z1752tja
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  no lines available in input
Calls: data.table -> read.csv -> read.table
Execution halted
qapa.qapa     - 2022-12-22 14:11:59,037 - INFO     - Finished!

and the final file is empty.

kcha commented 1 year ago

Hi, to date I still don't have a good grasp on this issue and don't have much bandwidth to investigate. It seems to affect a very small number of users. If you are able to, could you e-mail me (k ha mail utoronto ca) some of your data files and I can try to have a look if I have time. Can you include:

  1. Ensembl DB file
  2. One of your quant.sf files. If you like you can remove the actual TPM values, I am only interested in the sequence ID column.

Also what version of data.table do you have installed?

NJU-Bio-Info commented 1 year ago

Hi, I have sent you an email about the details. @kcha

kcha commented 1 year ago

@NJU-Bio-Info, thanks for sending your files. I think I found the cause. It looks like for human there were a handful of genes on chrY that had underscores in the Ensembl version string:

ENST00000381657_ENSG00000182378.15_PAR_Y_hsa_chrY_299096_303356_+_utr_299335_303356(+)
ENST00000432318_ENSG00000198223.17_PAR_Y,ENST00000494969_ENSG00000198223.17_PAR_Y,ENST00000355432_ENSG00000198223.17_PAR_Y,ENST00000381529_ENSG00000198223.17_PAR_Y_hsa_chrY_1309401_1309921_+_utr_1309868_1309921(+)
ENST00000331035_ENSG00000185291.12_PAR_Y_hsa_chrY_1382390_1382685_+_utr_1382465_1382685(+)
ENST00000313871_ENSG00000197976.12_PAR_Y_hsa_chrY_1600658_1602514_+_utr_1601594_1602514(+)
ENST00000262640_ENSG00000124333.16_PAR_Y,ENST00000286448_ENSG00000124333.16_PAR_Y_hsa_chrY_57128402_57130289_+_utr_57128659_57130289(+)
ENST00000381401_ENSG00000169100.14_PAR_Y_hsa_chrY_1386151_1386759_-_utr_1386151_1386601(-)

This was unexpected and the extra underscores like .15_PAR_Y caused QAPA's string parsing to fail. To get around this, you should use Ensembl Gene IDs without version numbers, which is what QAPA expects.

In the meantime as a quick solution, I suggest removing these entries from your quant files entirely. For example:

grep -v "_PAR_" quant.sf > quant2.sf

Then try qapa quant again.

In summary: the issue is not due to data.table versions, but rather unexpected inclusion of underscores in version IDs. QAPA expects Ensembl Gene IDs without the version number.