Error with FuSeq.R - Githubissues

MingL196 commented 5 years ago

Hi,

I've been trying to run FuSeq. I've successfully run the C++ portion of the program, and now I am trying to run the FuSeq.R.

I'm getting the error: replacement has 1 row, data has 0

After looking through the R code, I tracked this down to line 55 in postProcessSplitRead.R where the myFusionTmp2 had no rows, making the code myFusionTmp2$mappedCount=0 return an error. Trying to fix this caused more errors of the same type further down in the program.

Should myFusionTmp2 ever be empty?

Thanks, Ming

nghiavtr commented 5 years ago

Dear Ming,

Thank you for using FuSeq.

It is normal to have no row in myFusionTmp2 since it depends on the previous filter steps. If the number of rows is zero, no split reads remain.

I have not got this error before so the current codes do not deal with the issue. If you send me your Rdata, I would revise the codes, and send it to you to make it running smoothly, thanks.

Best wishes, Nghia

MingL196 commented 5 years ago

I've uploaded the input to https://github.com/MingL196/test . The command I'm using to analyze this is:

Rscript $fuseq_pwd/R/FuSeq.R in=class2 txfasta=Homo_sapiens.GRCh37.75.cdna.all.fa sqlite=Homo_sapiens.GRCh37.75.sqlite txanno=Homo_sapiens.GRCh37.75.txAnno.RData out=R1_001_out2

Maybe it's the R version? (I'm using R 3.5.2) For example, the commands

test=data.frame(b=c(1,2),c=c(3,4))
test=test[c(),]
test$b=0

returns an error for me.

Thanks for your help!

nghiavtr commented 5 years ago

Dear Ming,

Thank you for your files.

From your command, it is likely that you run FuSeq with the default parameter settings which consider unstranded read pairs (UN). However, as I observed, your data have no information of mapped reads for the modes of "UN". Please check if you forgot to upload the data. If missing this information, it is easy to explain the errors.

Best, Nghia

MingL196 commented 5 years ago

I forgot that github has a 50 mb file size limit. Sorry about that!

I've gzipped each input file and re-uploaded them.

Thanks! Ming

nghiavtr commented 5 years ago

Dear Ming,

Thank you for your files. The error has been fixed, so please replace the file "postProcessSplitRead.R" in your R folder of FuSeq by the most updated one here: https://github.com/nghiavtr/FuSeq/blob/master/R/postProcessSplitRead.R

If you have any further issues about FuSeq, please let us know.

Thank you and good lucks to your research. Nghia

MingL196 commented 5 years ago

Thank you! I was able to run the previous dataset using the GRCh37 without errors.

I am, however, having trouble getting the data to run on GRCh38.

I used latest ensemble data: the cDNA data is from ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz and the gtf file is from ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz with the Rdata downloaded off of github https://github.com/nghiavtr/FuSeq/releases/download/v1.1.0/Homo_sapiens.GRCh38.94.txAnno.RData

I cleaned the cdna data's transcript and gene names (Note: the recommended excludeTxVersion.R only cleaned the transcript names. I used an additional command to clean the gene names).

I traced the error to processSplitRead.R line 208. For one of the miR, txExonMat is empty, causing the code if (txExonMat$TXSTRAND[1]=="+") { to return an error.

The specific txname that caused the error was "ENST00000328980". Unexpectedly, this transcript is in the cdna file, but not the gtf file, which may be why the program is returning an error.

Any idea how this can be fixed?

=================

To share the data, I saved the Rdata between lines 144 and 145 of FuSeq.R

    FuSeq.MR=processMappedRead(inPath,geneAnno=geneAnno,  anntxdb=anntxdb, geeqMap=geeqMap,FuSeq.params=FuSeq.params)
    save.image("test.rdata")
    FuSeq.SR=processSplitRead(inPath,geneAnno=geneAnno, anntxdb=anntxdb, FuSeq.params=FuSeq.params, txFastaFile=txFastaFile)

The test.rdata has been uploaded to https://github.com/MingL196/test

Thanks! Ming

nghiavtr commented 5 years ago

Dear Ming,

Yes the discordance between cdna and gtf can make the errors
However, it is strange for your case because ENST00000328980 belongs to chromosome :CHR_HSCHR6_MHC_QBL_CTG1 which is not in the list of selected chromosomes (by default with only chromosomes 1-22, X and Y) and should be filtered out at line 47 ( using chromFilter function ) of processSplitRead.R

So please check if the filter at line 47 is missing from your codes.

Best, Nghia

MingL196 commented 5 years ago

I checked. Line 47 fusionGene=chromFilter(fusionGene) is not missing in processSplitRead.R.

Strangely, the output of c++ portion of FuSeq, "splitReadInfo_1681692777.txt" paired transcript ENST00000328980 with gene ENSG00000182687 .

chromFilter filters by gene, and Gene ENSG00000182687 is located in chromosome 17.

It seems that the unusual chromosome names are causing some other bug in TxIndexer or c++ FuSeq, making the transcripts pair with the incorrect gene.

Removing all cdna with unusual chromosome names before using TxIndexer and FuSeq seems to fix this problem.

Could you confirm whether this is indeed the case?

Thanks, Ming

nghiavtr commented 5 years ago

Hi Ming,

1) The gene mapping is definitely wrong. I have checked and found that this gene "ENSG00000182687" has a transcript "ENST00000329003" in gtf file which might be the most similar name with the transcript "ENST00000328980" in cdna fasta. Thus the software basically selects the best match for it. This works well if your gtf and cdna are concordant, but unfortunately it does not work for your case.

2) "Removing all cdna with unusual chromosome names before using TxIndexer and FuSeq seems to fix this problem": Yes, sure, making the concordance between gtf and cdna is the best solution.

3) However, I fixed this discordance in function processSplitRead.R(). Simply, we just keep only transcripts appear in gtf file to avoid the error. This is done by a small script at lines 37-41 of the updated codes (https://github.com/nghiavtr/FuSeq/blob/master/R/processSplitRead.R). This error should not happen for the mapped-read pipeline because the discordant transcripts are discarded automatically in some filters.

I hope this would help to solve your issues.

Best, Nghia

nghiavtr commented 5 years ago

Hi Ming,

I have released FuSeq v1.1.2 with some fixed bugs and solve the problem of discordant transcripts between cdna and gtf. Basically discordant transcripts in the cdna will be removed. Please see the details in section 7 (https://github.com/nghiavtr/FuSeq#7-create-a-supporting-annotation-rdata-file).

Thank you for your usage and contribution to improve FuSeq.

Best, Nghia

nghiavtr / FuSeq

Error with FuSeq.R #7