mahmoudparsian / data-algorithms-book

MapReduce, Spark, Java, and Scala for Data Algorithms Book
http://mapreduce4hackers.com
Other
1.07k stars 666 forks source link

FastaRecordReader for huge fasta files #31

Open jmabuin opened 5 years ago

jmabuin commented 5 years ago

Hi,

I have a question about the FastaRecordReader class data-algorithms-book/src/main/java/org/dataalgorithms/chap24/mapreduce/FastaRecordReader.java

I have been trying to use it for large genomes (fasta files much larger than a HDFS block, ie: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.38_GRCh38.p12/GCF_000001405.38_GRCh38.p12_genomic.fna.gz) but I am getting wrong sequences.

Is it possible that using this classes from Spark with newAPIHadoopFile method does not work for very large files? Or maybe am I missing something?

Regards, and thank you very much for your time.

Jose M. Abuin

mahmoudparsian commented 5 years ago

Hello Jose, I will look into this and test it with your input. Thanks, best regards, Mahmoud