sagnikbanerjee15 / Finder

A fully automated gene annotator from RNA-Seq expression data
MIT License
53 stars 14 forks source link

Genome examples empty? #47

Open paul-bio opened 2 years ago

paul-bio commented 2 years ago

Hi I am trying to predict gene structure using Finder. And it seems this tool is better than PASA, MAKER,,, So I am planning to get used to this tool.

However, the example data you shared, I could get metadata, protein data, and rawdata but could not find genome data. Where can I find the genome sequence?

Thanks a lot for us to use this beautiful tool. Sincerely, Paul.

sagnikbanerjee15 commented 2 years ago

Hello @paul-bio,

Thank you so much for your interest in finder. You have to download the genome file from this link -> http://ftp.ensemblgenomes.org/pub/plants/release-52/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz. Currently, I am a bit busy with my work. I will make changes to the README file later. Also, please make sure you enter the correct location of the directory for the dummy data in metadata.csv file. It should be the location where you performed the git pull.

Please let us know if you encounter any issues while running the software.

Thank you

paul-bio commented 2 years ago

Thanks. I downloaded genome data from the link you mentioned.

And suddenly got error message as below

cp: cannot stat ‘Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa’: No such file or directory [E::fai_build3_core] Failed to open the file /NFS2/users/creo9447/finder_tutorial/1.raw_data/FINDER/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa [faidx] Could not build fai index /NFS2/users/creo9447/finder_tutorial/1.raw_data/FINDER/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.fai Traceback (most recent call last): File "/NFS2/users/creo9447/software/Finder/Finder-finder_v1.1.0/finder", line 688, in main() File "/NFS2/users/creo9447/software/Finder/Finder-finder_v1.1.0/finder", line 628, in main validateCommandLineArguments( options, logger_proxy, logging_mutex ) File "/NFS2/users/creo9447/software/Finder/Finder-finder_v1.1.0/finder", line 244, in validateCommandLineArguments fhr = open( options.genome, "r" ) FileNotFoundError: [Errno 2] No such file or directory: '/NFS2/users/creo9447/finder_tutorial/1.raw_data/FINDER/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa'

What should I do?

++And it seems like in Finder/example/raw_data/ path, there were two fastq files (dummy_data1.fastq and dummy_data2.fastq.gz). And data1.fastq seems normal but data2.fastq.gz files doesn't seems to contain fastq contents.

++And this is the code I used $finder -mf metadata.xlsx --output_dir ./FINDER --genome Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa --organism_model PLANTS --genemark_path /software/Finder/geneMark/gmes_linux_64/ --genemark_license /software/Finder/GeneMark/gm_key_64 --cpu 32 --no_cleanup --protein uniprot_ARATH.fasta

sagnikbanerjee15 commented 2 years ago

Hello Paul,

Thanks for posting the error message. There are a few tweaks you need to make to the command that you are running. Firstly, it should be metadata.csv, not metadata.xlsx. Secondly, could you please make sure that the genome is in fact present in the current directory? I always provide the whole path (from /) to a program. That way I can execute it from anywhere and not really worry about changing anything. It seems that finder cannot locate the genome file. Thirdly, the command you should execute is run_finder and not finder. run_finder will check your system for the presence of docker or singularity. Depending on which one is available it will execute finder.

data2.fastq.gz is a compressed fastq file. If you open it up you will see some garbled text which is expected since the data is binary format not human-readable. I included that in the example run to demonstrate that finder can work with compressed RNA-Seq samples as well.

Please let me know if you run into any further problems.

Thank you.

paul-bio commented 2 years ago

Thank you for your help. As you mentioned, typing full path seems working now. And also I changed metadata with csv format. But found this error massage.

cat: /NFS2/users/creo9447/finder_tutorial/FINDER/alignments/dummy_round1_SJ.out.tab: No such file or directory cat: /NFS2/users/creo9447/finder_tutorial/FINDER/alignments/dummy_round2_SJ.out.tab: No such file or directory mv: cannot stat ‘/NFS2/users/creo9447/finder_tutorial/FINDER/alignments/dummy_final_Unmapped.out.mate1’: No such file or directory mv: cannot stat ‘/NFS2/users/creo9447/finder_tutorial/FINDER/alignments/dummy_final_Log.final.out’: No such file or directory cat: /NFS2/users/creo9447/finder_tutorial/FINDER/alignments/dummy_round3_SJ.out.tab: No such file or directory samtools index: "/NFS2/users/creo9447/finder_tutorial/FINDER/alignments/dummy_final.sortedByCoord.out.bam" is in a format that cannot be usefully indexed samtools index: "/NFS2/users/creo9447/finder_tutorial/FINDER/alignments/dummy_final.sortedByCoord.out.bam" is in a format that cannot be usefully indexed sh: junc: command not found sh: subexon-info: command not found [main_samview] fail to read the header from "/NFS2/users/creo9447/finder_tutorial/FINDER/alignments/dummy_final.sortedByCoord.out.bam". [main_samview] fail to read the header from "/NFS2/users/creo9447/finder_tutorial/FINDER/alignments/dummy_for_psiclass.sam". mv: cannot stat ‘/NFS2/users/creo9447/finder_tutorial/FINDER/assemblies_psiclass_modified/combined/psiclass_output_sample_0.gtf’: No such file or directory mv: cannot stat ‘/NFS2/users/creo9447/finder_tutorial/FINDER/assemblies_psiclass_modified/combined/psiclass_output_vote.gtf’: No such file or directory Traceback (most recent call last): File "/NFS2/users/creo9447/software/Finder/Finder-finder_v1.1.0/finder", line 688, in main() File "/NFS2/users/creo9447/software/Finder/Finder-finder_v1.1.0/finder", line 649, in main orchestrateGeneModelPrediction( options, logger_proxy, logging_mutex ) File "/NFS2/users/creo9447/software/Finder/Finder-finder_v1.1.0/finder", line 461, in orchestrateGeneModelPrediction findTranscriptsInEachSampleNotReportedInCombinedAnnotations( options, logger_proxy, logging_mutex ) File "/NFS2/users/creo9447/software/Finder/Finder-finder_v1.1.0/scripts/findTranscriptsInEachSampleNotReportedInCombinedAnnotations.py", line 17, in findTranscriptsInEachSampleNotReportedInCombinedAnnotations combined_transcript_info = readAllTranscriptsFromGTFFileInParallel( [combined_gtf_filename, "combined", "combined"] )[0] File "/NFS2/users/creo9447/software/Finder/Finder-finder_v1.1.0/scripts/fileReadWriteOperations.py", line 290, in readAllTranscriptsFromGTFFileInParallel fhr = open( gtf_filename, "r" ) FileNotFoundError: [Errno 2] No such file or directory: '/NFS2/users/creo9447/finder_tutorial/FINDER/assemblies_psiclass_modified/combined/combined.gtf'

protein, genome, metadata, fastq files are all in same directory (finder_tutorial). Since i am running FINDER with local linux, I couldn't use Docker... And those are the metadata.csv I used metadata.CSV

Can you suggest any help? Thank you :)

sagnikbanerjee15 commented 2 years ago

Hello @paul-bio,

Thank you for posting this. Could you please confirm that you have either singularity or docker installed on your system? Please paste the outputs of which docker and which singularity. Are you running this on your personal computer or on a computational cluster?

Thank you.

paul-bio commented 2 years ago

I am currently running on a computational cluster. And I am used to work in conda environment, but there were none.

First I downloaded FINDER with wget as below $wget https://github.com/sagnikbanerjee15/Finder/archive/refs/tags/finder_v1.1.0.tar.gz

since run_finder works with docker, I used finder instead

And get GeneMart key and files from website.

Thank you.

sagnikbanerjee15 commented 2 years ago

Actually, the docker image has all the software preinstalled. finder will not work on a conda environment since that is not maintained anymore.

Hope that helps!

Thank you.

paul-bio commented 2 years ago

Hello again. I installed docker and found error message...

This is the commend $run_finder -mf /NFS2/users/creo9447/software/FINDER/example/Arabidopsis_thaliana_metadata.csv -out_dir /NFS2/users/creo9447/software/FINDER/example/FINDER -g /NFS2/users/creo9447/software/FINDER/example/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa --protein /NFS2/users/creo9447/software/FINDER/example/uniprot_ARATH.fasta -om PLANTS --genemark_path /NFS2/users/creo9447/software/GeneMark/gmes_linux_64/ --genemark_license /NFS2/users/creo9447/software/GeneMark/gm_key_64 --cpu 32 --framework docker

And this is the log Trying to pull repository docker.io/sagnikbanerjee15/finder ... 1.1.0: Pulling from docker.io/sagnikbanerjee15/finder Digest: sha256:9816d258d2421d4625983c929f508b1f577cfe7ab3bc2042e841647a186c7931 Status: Image is up to date for docker.io/sagnikbanerjee15/finder:1.1.0 done cat: /NFS2/users/creo9447/software/FINDER/example/FINDER/alignments/dummy_data1_round1_SJ.out.tab: No such file or directory cat: /NFS2/users/creo9447/software/FINDER/example/FINDER/alignments/dummy_data1_round2_SJ.out.tab: No such file or directory mv: cannot stat '/NFS2/users/creo9447/software/FINDER/example/FINDER/alignments/dummy_data1_final_Unmapped.out.mate1': No such file or directory mv: cannot stat '/NFS2/users/creo9447/software/FINDER/example/FINDER/alignments/dummy_data1_final_Log.final.out': No such file or directory cat: /NFS2/users/creo9447/software/FINDER/example/FINDER/alignments/dummy_data1_round3_SJ.out.tab: No such file or directory samtools index: "/NFS2/users/creo9447/software/FINDER/example/FINDER/alignments/dummy_data1_final.sortedByCoord.out.bam" is in a format that cannot be usefully indexed samtools index: "/NFS2/users/creo9447/software/FINDER/example/FINDER/alignments/dummy_data1_final.sortedByCoord.out.bam" is in a format that cannot be usefully indexed [bam_header_read] EOF marker is absent. The input is probably truncated. [bam_header_read] invalid BAM binary header (this is not a BAM file). [bam_header_read] EOF marker is absent. The input is probably truncated. [bam_header_read] invalid BAM binary header (this is not a BAM file). Can not open /NFS2/users/creo9447/software/FINDER/example/FINDER/alignments/dummy_data1_final.sortedByCoord.out.bam. [main_samview] fail to read the header from "/NFS2/users/creo9447/software/FINDER/example/FINDER/alignments/dummy_data1_final.sortedByCoord.out.bam". [main_samview] fail to read the header from "/NFS2/users/creo9447/software/FINDER/example/FINDER/alignments/dummy_data1_for_psiclass.sam". mv: cannot stat '/NFS2/users/creo9447/software/FINDER/example/FINDER/assemblies_psiclass_modified/combined/psiclass_output_sample_0.gtf': No such file or directory mv: cannot stat '/NFS2/users/creo9447/software/FINDER/example/FINDER/assemblies_psiclass_modified/combined/psiclass_output_vote.gtf': No such file or directory Traceback (most recent call last): File "/softwares/FINDER/Finder/finder", line 688, in main() File "/softwares/FINDER/Finder/finder", line 649, in main orchestrateGeneModelPrediction( options, logger_proxy, logging_mutex ) File "/softwares/FINDER/Finder/finder", line 461, in orchestrateGeneModelPrediction findTranscriptsInEachSampleNotReportedInCombinedAnnotations( options, logger_proxy, logging_mutex ) File "/softwares/FINDER/Finder/scripts/findTranscriptsInEachSampleNotReportedInCombinedAnnotations.py", line 17, in findTranscriptsInEachSampleNotReportedInCombinedAnnotations combined_transcript_info = readAllTranscriptsFromGTFFileInParallel( [combined_gtf_filename, "combined", "combined"] )[0] File "/softwares/FINDER/Finder/scripts/fileReadWriteOperations.py", line 290, in readAllTranscriptsFromGTFFileInParallel fhr = open( gtf_filename, "r" ) FileNotFoundError: [Errno 2] No such file or directory: '/NFS2/users/creo9447/software/FINDER/example/FINDER/assemblies_psiclass_modified/combined/combined.gtf'

What might be the problem? Sorry for keep bothering you ... Thanks anyway

sagnikbanerjee15 commented 2 years ago

Hello @paul-bio,

Could you please try again after removing the output directory? Also, please send me the contents of the file /NFS2/users/creo9447/software/FINDER/example/Arabidopsis_thaliana_metadata.csv

Thank you.

paul-bio commented 2 years ago

Sorry for keep bothering you. sagnikbanerjee.

here is the meta data file i used

Arabidopsis_thaliana_metadata.csv

And when I remove output file and a shorter error massage came like this

Trying to pull repository docker.io/sagnikbanerjee15/finder ... 1.1.0: Pulling from docker.io/sagnikbanerjee15/finder Digest: sha256:9816d258d2421d4625983c929f508b1f577cfe7ab3bc2042e841647a186c7931 Status: Image is up to date for docker.io/sagnikbanerjee15/finder:1.1.0 done rm: cannot remove '/NFS2/users/creo9447/software/FINDER/example/FINDER/assemblies_psiclassmodified/combined/outputfileforCPD*': No such file or directory Traceback (most recent call last): File "/softwares/FINDER/Finder/finder", line 688, in main() File "/softwares/FINDER/Finder/finder", line 673, in main addBRAKERPredictions( options, logger_proxy, logging_mutex ) File "/softwares/FINDER/Finder/scripts/predictGenesUsingBRAKER.py", line 287, in addBRAKERPredictions fhr = open( options.output_assemblies_psiclass_terminal_exon_length_modified + "/proteins_comparison_gffcompare.proteins_for_alignment.gtf.refmap", "r" ) FileNotFoundError: [Errno 2] No such file or directory: '/NFS2/users/creo9447/software/FINDER/example/FINDER/assemblies_psiclass_modified/proteins_comparison_gffcompare.proteins_for_alignment.gtf.refmap'

Thank you

sagnikbanerjee15 commented 2 years ago

Hello @paul-bio,

Thanks for your reply. And please do not hesitate to ask questions. Feedback like this will make finder even better.

I looked at the metadata.csv file and it seems everything is correct. Is there any reason why you are running the example with only one RNA-Seq dataset?

Could you send me the output of the command ls -lhrt /NFS2/users/creo9447/software/FINDER/example/FINDER/alignments ? And also the progress.log file.

Thank you.

paul-bio commented 2 years ago

Thanks.

I am trying to get used to this tool. And l am planning to use it with my project. So as a initial step toward learning this tool I am analyzing with data downloaded from this github. In the raw_data, there were two files. dummy_data1.fastq and dummy_fastq2.gz. And since I have no evidence of the sequencing data, first I thought those two files can be two different single end files. So I only used dummy_data1.fastq file. Is it paired-end file?

Anyway, I copied the output of the alignmnets and progresslog.file. progress.log

$ls -lhrt "/NFS2/users/creo9447/software/FINDER/example/FINDER/alignments/" total 440K -rw-r--r--. 1 root root 0 Feb 11 22:08 dummy_data1_round1.error -rw-r--r--. 1 root root 772 Feb 11 22:08 dummy_data1_round1_SJ.out.tab -rw-r--r--. 1 root root 2.0K Feb 11 22:08 dummy_data1_round1_Log.final.out -rw-r--r--. 1 root root 1.2K Feb 11 22:08 dummy_data1_round1.output -rw-r--r--. 1 root root 0 Feb 11 22:08 leaf_round1_SJ.out.tab -rw-r--r--. 1 root root 0 Feb 11 22:08 dummy_data1_round2.error -rw-r--r--. 1 root root 0 Feb 11 22:08 dummy_data1_round2_SJ.out.tab -rw-r--r--. 1 root root 2.0K Feb 11 22:08 dummy_data1_round2_Log.final.out -rw-r--r--. 1 root root 1.4K Feb 11 22:08 dummy_data1_round2.output -rw-r--r--. 1 root root 0 Feb 11 22:08 leaf_round2_SJ.out.tab -rw-r--r--. 1 root root 0 Feb 11 22:08 leaf_round1_and_round2_SJ.out.tab -rw-r--r--. 1 root root 0 Feb 11 22:08 dummy_data1_round3.error -rw-r--r--. 1 root root 0 Feb 11 22:08 dummy_data1_round3_SJ.out.tab -rw-r--r--. 1 root root 2.0K Feb 11 22:09 dummy_data1_round3_Log.final.out -rw-r--r--. 1 root root 1.2K Feb 11 22:09 dummy_data1_round3.output -rw-r--r--. 1 root root 0 Feb 11 22:09 leaf_round3_SJ.out.tab -rw-r--r--. 1 root root 0 Feb 11 22:09 leaf_round1_and_round2_and_round3_SJ.out.tab -rw-r--r--. 1 root root 261 Feb 11 22:09 dummy_data1_round5.error -rw-r--r--. 1 root root 21K Feb 11 22:09 dummy_data1_final.sortedByCoord.out.bam -rw-r--r--. 1 root root 56K Feb 11 22:09 dummy_data1_final.sortedByCoord.out.bam.bai -rw-r--r--. 1 root root 873 Feb 11 22:09 dummy_data1_final.sortedByCoord.out.bam.csi -rw-r--r--. 1 root root 748 Feb 11 22:09 dummy_data1_introns -rw-r--r--. 1 root root 460 Feb 11 22:09 dummy_data1_introns.bed -rw-r--r--. 1 root root 3.5K Feb 11 22:09 dummy_data1_exons -rw-r--r--. 1 root root 1.1K Feb 11 22:09 dummy_data1_exons.bed -rw-r--r--. 1 root root 595 Feb 11 22:09 dummy_data1_num_exons_in_intron -rw-r--r--. 1 root root 94K Feb 11 22:09 dummy_data1_final.sortedByCoord.out.sam -rw-r--r--. 1 root root 94K Feb 11 22:09 dummy_data1_for_psiclass.sam -rw-r--r--. 1 root root 21K Feb 11 22:09 dummy_data1_for_psiclass.bam -rw-r--r--. 1 root root 257 Feb 11 22:09 mapping_stats.csv -rw-r--r--. 1 root root 0 Feb 11 22:09 dummy_data1_counts.output -rw-r--r--. 1 root root 0 Feb 11 22:09 dummy_data1_counts.error -rw-r--r--. 1 root root 8.1K Feb 11 22:09 dummy_data1_counts_genome_cov.bed -rw-r--r--. 1 root root 20 Feb 11 22:09 dummy_data1_counts_all_info.pkl -rw-r--r--. 1 root root 56K Feb 11 22:09 dummy_data1_for_psiclass.bam.bai -rw-r--r--. 1 root root 874 Feb 11 22:09 dummy_data1_for_psiclass.bam.csi -rw-r--r--. 1 root root 0 Feb 11 22:09 dummy_data1_SJ_regtools.bed.output -rw-r--r--. 1 root root 341 Feb 11 22:09 dummy_data1_SJ_regtools.bed.error -rw-r--r--. 1 root root 1.8K Feb 11 22:09 dummy_data1_SJ_regtools.bed

Thanks.

sagnikbanerjee15 commented 2 years ago

Hello @paul-bio,

Thank you for deciding to use finder in your project. I would recommend that you use the entire metadata.csv file and not just the dummy data. The dummy data contain very few reads which are not enough to generate any annotations. The reason for including those data is to ensure that the pipeline can process locally available data. Don't worry about not having the rest of the data. finder will automatically download those from NCBI SRA. This is the command you should try:

# Remove the output directory
rm -rf /NFS2/users/creo9447/software/FINDER/example/FINDER

# Run the program with the entire metadata

$run_finder -mf /NFS2/users/creo9447/software/FINDER/example/Arabidopsis_thaliana_metadata.csv -out_dir /NFS2/users/creo9447/software/FINDER/example/FINDER -g /NFS2/users/creo9447/software/FINDER/example/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa --protein /NFS2/users/creo9447/software/FINDER/example/uniprot_ARATH.fasta -om PLANTS --genemark_path /NFS2/users/creo9447/software/GeneMark/gmes_linux_64/ --genemark_license /NFS2/users/creo9447/software/GeneMark/gm_key_64 --cpu 32 --framework docker

Please let me know if this works.

Thank you.

paul-bio commented 2 years ago

Hi again. With command you suggested, still a error message came out. And it seems there is a problem in .gm_key when running braker. Here is the braker.error.txt, progress.log and the error messageI got. braker.error.txt progress.log

Trying to pull repository docker.io/sagnikbanerjee15/finder ... 1.1.0: Pulling from docker.io/sagnikbanerjee15/finder Digest: sha256:9816d258d2421d4625983c929f508b1f577cfe7ab3bc2042e841647a186c7931 Status: Image is up to date for docker.io/sagnikbanerjee15/finder:1.1.0 done rm: cannot remove '/NFS2/users/creo9447/software/FINDER/example/FINDER/assemblies_psiclassmodified/combined/outputfileforCPD*': No such file or directory Traceback (most recent call last): File "/softwares/FINDER/Finder/finder", line 688, in main() File "/softwares/FINDER/Finder/finder", line 673, in main addBRAKERPredictions( options, logger_proxy, logging_mutex ) File "/softwares/FINDER/Finder/scripts/predictGenesUsingBRAKER.py", line 287, in addBRAKERPredictions fhr = open( options.output_assemblies_psiclass_terminal_exon_length_modified + "/proteins_comparison_gffcompare.proteins_for_alignment.gtf.refmap", "r" ) FileNotFoundError: [Errno 2] No such file or directory: '/NFS2/users/creo9447/software/FINDER/example/FINDER/assemblies_psiclass_modified/proteins_comparison_gffcompare.proteins_for_alignment.gtf.refmap'

I also noticed that in your github you said I have to get GeneMark-ES/ET/EP ver 4.62. However in the website link, a new version of 4.69 is currently available. Is it possible the error kept emerging is because of different version?

sagnikbanerjee15 commented 2 years ago

Hello @paul-bio,

Thanks for sending me the error files. I checked the progress.log and it seems like you did not use the metadata.csv file from the GitHub repo. It contains only the dummy data. Please rerun the program with the original metadata file. I don't think the version of GeneMark-ES/ET/EP would matter in this case.

Thank you.

paul-bio commented 2 years ago

HI @sagnikbanerjee15

I changed my metadata.csv. And this time I have lot more error than previous runs. error_message.txt

And used this metadata file. I don't think now there is no problem in metadata nor raw data. Arabidopsis_thaliana_metadata.csv

Thanks.

sagnikbanerjee15 commented 2 years ago

Hello @paul-bio,

Thanks for posting the error. The command looks good and so does the metadata file. I will need some time to figure out the problem. I will let you know when I am done.

Thank you.

paul-bio commented 2 years ago

Thanks @sagnikbanerjee15 I hope the problem fixed soon.

Thanks a lot.