Debug v.0.4.2 with example data

Hi Jeff,

With your updated documentation, it is much easier to repeat the example analysis using knock-knock. I am summarizing the bugs, issues, and concerns that I found when I explore knock-knock with the example data (Pacbio) and the latest documentation (commit: 0c347a33e24754e1d564348cf82b8ff1308bd92a). Hope it can save you some time when you update knock-knock.

installation: setup.py requires "hits v0.4.1", which is not available from PyPI. https://github.com/jeffhussmann/knock-knock/blob/0c347a33e24754e1d564348cf82b8ff1308bd92a/setup.py#L53

Install example dataset:

Command used:

knock-knock install_example_data PROJECT_DIR

Issue1: command line in documentation is not up to date, the following one is correct.
```
knock-knock install-example-data PROJECT_DIR
```

Issue2: need to install the following missing libraries first:

conda install -c anaconda seaborn
conda install -c anaconda scipy
conda install -c anaconda statsmodels

Build targets:
- Command used:
```
knock-knock build_targets PROJECT_DIR
```
- Issue1: command line in documentation is not up to date, the following one is correct.
```
knock-knock build-targets PROJECT_DIR
```
The parallel command
- Command used:
```
knock-knock parallel PROJECT_DIR 4 --group pacbio
```
- Error: AttributeError: 'PacbioExperiment' object has no attribute 'generate_alignments'. Did you mean: 'get_read_alignments'?
- So, I decide to test knock-knock process command first. Encountered same issue, therefore, decide to test each --stage separately.
The process --stage preprocess command: works well
The process --stage align command:
- Command used:
```
knock-knock process PROJECT_DIR pacbio R_PCR --stage align
```
- Error1: missing function "generate_alignments"
- Solution: change to use "self.generate_alignments_with_blast" for the following line: https://github.com/jeffhussmann/knock-knock/blob/0c347a33e24754e1d564348cf82b8ff1308bd92a/knock_knock/pacbio_experiment.py#L112
- Error2: Key Error: 'e_coli'
- Solution: the documentation does not mention that, for this dataset, we need to run the following line first to build indices for supplementary genome e_coli:
```
knock-knock build-indices PROJECT_DIR e_coli
```
The process --stage categorize command:
- Command used:
```
knock-knock process PROJECT_DIR pacbio R_PCR --stage categorize
```
- Error: 'PacbioExperiment' object has no attribute 'uncommon_read_type'
- Solution: add self.uncommon_read_type = 'CCS' into "pacbio_experiment.py"
The process --stage generate_figure command:
- Command used:
```
`knock-knock process PROJECT_DIR pacbio R_PCR --stage generate_figure
```
- Error1: AttributeError: 'PacbioExperiment' object has no attribute
- Solution: add self.preprocessed_read_type = 'CCS' to "pacbio_experiment.py"
- Error2:AttributeError: 'TargetInfo' object has no attribute 'all_supplemental_reference_names'. Did you mean: 'supplemental_reference_sequences'?
- Solution: add self.all_supplemental_reference_names = [] into "target_info.py"
- Error3: ValueError: 'hg38_chr15' is not in list
- Concern: this is due to "architecture.py": seems like the following line is querying a reference_name that is not in the reference_order object: https://github.com/jeffhussmann/knock-knock/blob/0c347a33e24754e1d564348cf82b8ff1308bd92a/knock_knock/visualize/architecture.py#L601
- Partial solution: add a filtering step in front, not sure if this is a legit solution or not, the code is: alignments = copy.deepcopy([al for al in alignments if al.reference_name in reference_order])
- Error4: AttributeError: 'PacbioExperiment' object has no attribute 'batch":
- Solution: simply change batch to batch_name in "expriments.py": https://github.com/jeffhussmann/knock-knock/blob/0c347a33e24754e1d564348cf82b8ff1308bd92a/knock_knock/experiment.py#L1180
The knock-knock table command:
- Issue1: documentation command should be knock-knock table PROJECT_DIR not knock-knock table BASE_DIR
- Concern1: while I am testing using the Pacbio dataset only, this command assumes that both Illumina and Pacbio results are in place. I need to remove thedata/illumina folder for this command to not complain.
- Error1: missing "exp.batch"
- Solution: change "exp.batch" to "exp.batch_name" in the "experiment.py" script
- Error2: AttributeError: 'PacbioExperiment' object has no attribute 'experiment_group':
- Solution: fix by commenting out the corresponding lines from "table.py", not sure if this is a legit solution or not: https://github.com/jeffhussmann/knock-knock/blob/0c347a33e24754e1d564348cf82b8ff1308bd92a/knock_knock/table.py#L672-L673

After the above steps, I can successfully generate results using example dataset, which are saved here: https://www.dropbox.com/sh/21n95nh0quvom4i/AACjjtxzC3iXeXoFP-lIXPEra?dl=0, can you take a look there and let me know if the results look correct to you or not?

Another concern, I noticed that there are some hard coded genomes in "layout.py", see below for one example, does it mean other custom genomes are not supported? https://github.com/jeffhussmann/knock-knock/blob/0c347a33e24754e1d564348cf82b8ff1308bd92a/knock_knock/layout.py#L426-L431C15

Thank you for looking into this, if you are willing to review, I am happy to create a Pull request with all the changes. Let me know,

--Kai

jeffhussmann / knock-knock

Debug v.0.4.2 with example data #14