hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
196 stars 58 forks source link

Cobalt encoding issue? #93

Closed alhafidzhamdan closed 4 years ago

alhafidzhamdan commented 4 years ago

Hi there i'm trying to run cobalt standalone programme and got this error:

06:46:34 - COBALT version: 1.8 06:46:34 - Using non default value 16 for parameter threads 06:46:34 - Thread Count: 16, Window Size: 1000, Min Quality 10 06:46:34 - Reading GC Profile 06:46:34 - Complete Exception in thread "main" java.io.UncheckedIOException: java.nio.charset.MalformedInputException: Input length = 1 at java.io.BufferedReader$1.hasNext(BufferedReader.java:574) at java.util.Iterator.forEachRemaining(Iterator.java:115) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at com.hartwig.hmftools.common.utils.io.reader.LineReader.read(LineReader.java:38) at com.hartwig.hmftools.common.utils.io.reader.LineReader.lambda$build$0(LineReader.java:25) at com.hartwig.hmftools.common.genome.gc.GCProfileFactory.loadGCContent(GCProfileFactory.java:29) at com.hartwig.hmftools.cobalt.CountBamLinesApplication.run(CountBamLinesApplication.java:86) at com.hartwig.hmftools.cobalt.CountBamLinesApplication.main(CountBamLinesApplication.java:46) Caused by: java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:281) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.readLine(BufferedReader.java:324) at java.io.BufferedReader.readLine(BufferedReader.java:389) at java.io.BufferedReader$1.hasNext(BufferedReader.java:571) ... 12 more

I'm thinking it's an encoding issue but what encoding format do you use? Thank you.

jonbaber commented 4 years ago

Hi, I have not encountered this problem before but it looks like an issue with your GCProfile file.

Can you please do a "head" of the file, it should looks something like this:

$ head GC_profile.1000bp.cnp
1   0   -1  0   0
1   1000    -1  0   0
1   2000    -1  0   0
1   3000    -1  0   0
1   4000    -1  0   0
1   5000    -1  0   0
1   6000    -1  0   0
1   7000    -1  0   0
1   8000    -1  0   0
1   9000    -1  0   0
alhafidzhamdan commented 4 years ago

Hi @jonbaber thanks for getting back:

head GC_profile.hg38.1000bp.cnp.gz

chr1 0 -1 0 0 chr1 1000 -1 0 0 chr1 2000 -1 0 0 chr1 3000 -1 0 0 chr1 4000 -1 0 0 chr1 5000 -1 0 0 chr1 6000 -1 0 0 chr1 7000 -1 0 0 chr1 8000 -1 0 0 chr1 9000 -1 0 0

That's the file i downloaded directly from https://nextcloud.hartwigmedicalfoundation.nl/s/LTiKTd8XxBqwaiC?path=%2FHMFTools-Resources%2FCobalt

I reckon now it's probably to do with chromosome column naming then?

jonbaber commented 4 years ago

You are using the hg38 version of the GCProfile so I presume the bam you are trying to analyse is also on hg38. Could you please confirm that?

The file you have linked to is zipped. Have you unzipped it before supplying it to COBALT?

alhafidzhamdan commented 4 years ago

Yes hg38. And now i have and it's working. (without removing "chr" from the first column). FYI for anybody else encountering the same issue. Thanks! A

jonbaber commented 4 years ago

Glad you got it working. I will add some validation to the input to alert the user to unzip the file if they supply ".gz" file.

jonbaber commented 4 years ago

I have also updated the README to mention that the file needs to be unzipped.