some questions about the scripts in the pipeline

yozeng commented 8 months ago

Hi @rivasiker , I have two questions for you after studying the scripts in your pipeline. The first question is that I don't quite understand the meaning of LD_LIBRARY_PATH. Thus, I want to know if I need to change the paths specified by LD_LIBRARY_PATH before I run coalhmm on my own server? The second question is about the value of number of sequences per window returned by the start_end function. In the current case ("i - j" as the value of the number of sequences per window), it seems that coalhmm does not take as input the last block of each window (even though it was output in fasta format in the previous target). Would it make more sense to change "i - j" to "i - j + 1"? start_end coalhmm

rivasiker commented 8 months ago

Hi @yozeng! Thank you for your questions.

Regarding the LD_LIBRARY_PATH variable, it is possible that you need to change it. Depending on how your system is configured, you might not even need to specify it anyways.

Regarding the "i - j" to "i - j + 1" issue, you are absolutely right, it should be changed, otherwise the last block is not being considered. Thank you for pointing it out! That being said, the computation itself will not change. Notice that slice_lst[run][2] is only used to create a list of outputs for the gwf runs of create_fasta_and_info_tables.py and a list of inputs for the gwf runs of coalhmm. However, the lists themselves are not used as inputs for running the scripts. Instead, they are exclusively used to build the gwf workflow dependencies. You can see this here, where the actual input for create_fasta_and_info_tables.py is slice_lst[run][0] and slice_lst[run][1], which are both indeed properly specified in start_end.py. This has the effect that the gwf workflow thinks that the last block for each run is not necessary for running coalhmm, but it will actually be generated and taken into account when running coalhmm.

I hope that was clear!

yozeng commented 8 months ago

Thank you for your detailed answer! I think I understood your explanation. You mean that slice_lst[run][2] should indeed be changed to "i - j + 1", and that both slice_lst[run][0] and slice_lst[run][1] are already specified correctly! However, your first sentence says that the last block will not be considered in the current case, and your last sentence says that it will be taken into account when running coalhmm. This is a bit confusing to me.

rivasiker commented 8 months ago

Yes, exactly! Apologies for the confusing wording. In summary, slice_lst[run][2] should be changed to "i - j + 1", and both slice_lst[run][0] and slice_lst[run][1] are already specified correctly. In any case, slice_lst[run][2] is only used for specifying the workflow dependencies and not for the actual computations, so changing slice_lst[run][2] to "i - j + 1" should not have any effect on the computations themselves.

yozeng commented 8 months ago

Forgive me for being long-winded enough to ask. So, the only difference between "i - j" and "i - j + 1" is whether or not coalhmm takes the last block of each window as input when it runs, is that right?

yozeng commented 8 months ago

I apologize for my misunderstanding! Now I guess I figured it out. Actually, coalhmm has internal calls to the directory ../inputs/run_{}/ folder for all fasta files for each window, is that right? Since the content inside the file coalhmm is not directly viewable on this site, I've been assuming that the variable outputs is called as a whole inside coalhmm, hence this misunderstanding. I apologize for bothering you with such a minor issue! Also, I'd like to ask again if the LD_LIBRARY_PATH specification has anything to do with the internal implementation of coalhmm? Since I haven't looked at the specifics inside the coalhmm file yet, can you give me some advice on specifying LD_LIBRARY_PATH?

rivasiker commented 8 months ago

Yes, exactly! Since coalhmm has internal calls for the input files that do not depend on slice_lst[run][2], all of the blocks are being considered even though the gwf workflow is slightly misspecified.

I do not recall specifically why I set the LD_LIBRARY_PATH as I did, but I think it has to do with the dependencies of coalhmm. If you look at the coalhmm documentation, you need to have some Bio++ libraries installed to be able to run the program. You don't need to compile coalhmm yourself because autocoalhmm already contains a ready-to-use precompiled version of coalhmm, but you still need those libraries to run coalhmm. But again, my memory is a bit fuzzy since I did this some years ago...

yozeng commented 8 months ago

Ok, thank you very much for your patience! I'm going to try running autocoalhmm after installing the Bio++ libraries you mentioned above. Sincerely thank you!

rivasiker / autocoalhmm

some questions about the scripts in the pipeline #8