The package implements a pipeline consisting of a read preprocessing module followed by a visualization module. The preprocessing module takes raw reads(FASTQ) from a pooled multi-sample Oxford Nanopore sequencing run as input. Reads are demultiplexed into sample-specific FASTQs using Grepseq information.
[outline picture]
The individual pipeline steps are:
After all the above steps have been completed, Results can be viewed using IGV.
We recommend you use Anaconda to install all dependencies:
Install conda, take miniconda2 as an example:
wget https://repo.anaconda.com/miniconda/Miniconda2-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p
rm ~/miniconda.sh
source ~/.bashrc
conda --help
Use conda install all dependencies:
Make a conda virtual environment for greporeseq.
conda create -n greporeseq
Activate the conda greporeseq environment.
conda activate greporeseq
Install greporeseq dependencies by entering the greporeseq directory and running
conda install --file requirements.txt -y
To run the full greporeseq analysis pipeline, you must first have create 2 manifest YAML file that describes all pipeline inputs. Once you have done so, you can simply run
python /path/to/greporeseq.py all -n ***.fastq.gz -r ***RefInfo.yaml -d ***DemultiplexInfo.yaml
to run the entire pipeline. Below are specific instructions detailing how to write the manifest file.
If you wish to run an example on our abridged test data, You can do it by the following command
python ./greporeseq/greporeseq.py all -n ./test/N26chop.fastq.gz -r ./test/N26RefInfo.yaml -d ./test/N26DemultiplexInfo.yaml
When running all steps of greporeseq, it is necessary to describe each sample that needs to be demultiplex. YAML file can easily store all the input information and be read by greporeseq.
the fields contained in the DemultiplexInfo.yaml
are:
An example DemultiplexInfo.yaml
:
[SAMPLE_ID]:
reference_id: 4kAAVS1
left_150bp: tgcaaacaggaagtgaacggggaagggagggggcttctcatctgggtgcgggaaccccacatggtacctgttagacacggcaaaacccccgtcaccacccacaggtggcgcttccagtgctcagactagggaagaggttccagcccctcc
right_150bp: aaggagacaaagtccaggaccggctggaggggctcaacatcggaagaggggaagtcgagggagggatggtaaggaggactgcatgggtcagcacaggctgccaaagccagggccagttaaagcgactccaatgcggaagagagtaggtcg
BCprimer_F: atgataactaggTGCAAACAGGAAGTGAACGG
BClen_F: 12
#optional
BCprimer_R:
BClen_R:
unique_sequence:
the field contained in the RefInfo.yaml
:
sequence: Used to fill in the reference sequence. Note: BC sequences are usually not included in the reference sequence.
An example RefInfo.yaml
:
[REFERENCE_ID]:
sequence: [REFERENCE_SEQUENCE]
When too many samples need to be analysed at once, it can be very time consuming to enter the manifest YAML file manually, so we have provided an EXCEL sheet, GREPore-seq_Template.us.xlsx
, to quickly produce manifest manifest YAML file. Fill in the appropriate columns with the input information. Last column of the sheet, the YAML column, contains formulas that will automatically convert the information into YAML format. Copy and paste into a .yaml file, removing the double quotes before and after, to obtain a correctly formatted manifest YAML description file.
Below is an example of a full manifest file.
DemultiplexInfo.yaml
N26_WW1590_sg25_4kBC26_2020_06_19_853T4d_wt_4d:
reference_id: 4kAAVS1
left_150bp: tgcaaacaggaagtgaacggggaagggagggggcttctcatctgggtgcgggaaccccacatggtacctgttagacacggcaaaacccccgtcaccacccacaggtggcgcttccagtgctcagactagggaagaggttccagcccctcc
right_150bp: aaggagacaaagtccaggaccggctggaggggctcaacatcggaagaggggaagtcgagggagggatggtaaggaggactgcatgggtcagcacaggctgccaaagccagggccagttaaagcgactccaatgcggaagagagtaggtcg
BCprimer_F: atgataactaggTGCAAACAGGAAGTGAACGG
BClen_F: 12
#optional
BCprimer_R:
BClen_R:
unique_sequence:
N26_WW1591_sg25_4kBC27_2020_06_19_853T4d_RNPsyn25_4d:
reference_id: 4kAAVS1
left_150bp: tgcaaacaggaagtgaacggggaagggagggggcttctcatctgggtgcgggaaccccacatggtacctgttagacacggcaaaacccccgtcaccacccacaggtggcgcttccagtgctcagactagggaagaggttccagcccctcc
right_150bp: aaggagacaaagtccaggaccggctggaggggctcaacatcggaagaggggaagtcgagggagggatggtaaggaggactgcatgggtcagcacaggctgccaaagccagggccagttaaagcgactccaatgcggaagagagtaggtcg
BCprimer_F: catctcatctcgTGCAAACAGGAAGTGAACGG
BClen_F: 12
#optional
BCprimer_R:
BClen_R:
unique_sequence:
RefInfo.yaml
:
4kAAVS1:
sequence: tgcaaacaggaagtgaacggggaagggagggggcttctcatctgggtgcgggaaccccacatggtacctgttagacacggcaaaacccccgtcaccacccacaggtggcgcttccagtgctcagactagggaagaggttccagcccctcctccttcagagccaggagtcctggcccccagcccctcctgccttaaacccagccaggtccttccaagggtcaagctcggaaaccaccccagcagatactctgcaggaacgaagccgtgggcccagggctatgcagggtggaggaaggccaccctgtgctgggacagactcaggggcctgggcgggactcccagaggggtgagacagctgcacacctgtgtgcctgggccccaggctgtcacactccagttcactgaggccccctctgcacggggccctgcagccaggggctgacacgggccaccgtttctcattcttcccttaggggtccaaaacttggggggacaaaagccgaagtccagggggtcggaggagggacttgccccaggccttgtggacactgggtgggctccgggacctgaactggagctgaggaaggagtgaagctaaactcctagatccacgggataaattaccccccaagtccctcacctctccaaagctgcccatctggaggaggcgggagggagctacgagggccaagagcatgaggtcatggaaactcgggctgtgaaggggccgcacgtgccctgggaacgggatgaactcggctcgtttatttccacccagttgtcatggcgataggggaggggggcaaggagagcaatgggcctttccctttcaaggacctgcccagtacaggcatccctgtgaaagatgcctgaggcctgggcaccagggactccagagtccaggcccaacccctccccattcaacccaggaggccaggccccagcccttccgccctcagatgaaggagtccaggcccccagcctctccccattcagacccaggggtccaggcccagccccgcctccctaagacccagaagtccaggcccccagcccctcctccctcagacccacgagtccaggccccagcccctcctccctcggacccaggagtccaggcccccagtccctccaccctcagacccaggagtccaggccccagcccctcctccctcggacccaggagtccaggccccagcccctcctctctcaaacccaggagcccaggcccccagctcttctctgttcagccctaagaatcctggctccagcccctcctactctagcccccaaccccctagccactaaggcaattggggtgcaggaatgggggcagggtaccagcctcaccaagtggttgataaacccacgtggggtaccctaagaacttgggaacagccacagcaggggggcgatgcttggggacctgcctggagaaggatgcaggacgagaaacacagccccaggtggagaaactggccgggaatcaagagtcacccagagacagtgaccaaccatccctgttttcctaggactgagggtttcagtgctaaaactaggctgtcctgggcaaacagcataagctggtcaccccacacccagacctgacccaaacccagctcccctgcttcttggccacgtaacctgagaagggaatccctcctctctgaaccccagcccaccccaatgctccaggcctcctgggataccccgaagagtgagtttgccaagcagtcaccccacagttggaggagaatccacccaaaaggcagcctggtagacagggctggggtggcctctcgtggggtccaggccaagtaggtggcctggggcctctgggggatgcaggggaagggggatgcaggggaacggggatgcaggggaacggggctcagtctgaagagcagagccaggaacccctgtagggaaggggcaggagagccaggggcatgagatggtggacgaggaagggggacagggaagcctgagcgcctctcctgggcttgccaaggactcaaacccagaagcccagagcagggccttagggaagcgggaccctgctctgggcggaggaatatgtcccagatagcactggggactctttaaggaaagaaggatggagaaagagaaagggagtagaggcggccacgacctggtgaacacctaggacgcaccattctcacaaagggagttttccacacggacacccccctcctcaccacagccctgccaggacggggctggctactggccttatctcacaggtaaaactgacgcacggaggaacaatataaattggggactagaaaggtgaagagccaaagttagaactcaggaccaacttattctgattttgtttttccaaactgcttctcctcttgggaagtgtaaggaagctgcagcaccaggatcagtgaaacgcaccagacggccgcgtcagagcagctcaggttctgggagagggtagcgcagggtggccactgagaaccgggcaggtcacgcatcccccccttccctcccaccccctgccaagctctccctcccaggatcctctctggctccatcgtaagcaaaccttagaggttctggcaaggagagagatggctccaggaaatgggggtgtgtcaccagataaggaatctgcctaacaggaggtgggggttagacccaatatcaggagactaggaaggaggaggcctaaggatggggcttttctgtcaccaatcctgtccctagtggccccactgtggggtggaggggacagataaaagtacccagaaccagagccacattaaccggccctgggaatataaggtggtcccagctcggggacacaggatccctggaggcagcaaacatgctgtcctgaagtggacataggggcccgggttggaggaagaagactagctgagctctcggacccctggaagatgccatgacagggggctggaagagctagcacagactagagaggtaaggggggtaggggagctgcccaaatgaaaggagtgagaggtgacccgaatccacaggagaacggggtgtccaggcaaagaaagcaagaggatggagaggtggctaaagccagggagacggggtactttggggttgtccagaaaaacggtgatgatgcaggcctacaagaaggggaggcgggacgcaagggagacatccgtcggagaaggccatcctaagaaacgagagatggcacaggccccagaaggagaaggaaaagggaacccagcgagtgaagacggcatggggttgggtgagggaggagagatgcccggagaggacccagacacggggaggatccgctcagaggacatcacgtggtgcagcgccgagaaggaagtgctccggaaagagcatccttgggcagcaacacagcagagagcaaggggaagagggagtggaggaagacggaacctgaaggaggcggcagggaaggatctgggccagccgtagaggtgacccaggccacaagctgcagacagaaagcggcacaggcccaggggagagaatgcaggtcagagaaagcaggacctgcctgggaaggggaaacagtgggccagaggcggcgcagaagccagtagagctcaaagtggtccggactcaggagagagacggcagcgttagagggcagagttccggcggcacagcaagggcactcgggggcgagaggagggcagcgcaaagtgacaatggccagggccaggcagatagaccagactgagctatgggagctggctcaggttcaggagagggcagggcagggaaggagacaaagtccaggaccggctggaggggctcaacatcggaagaggggaagtcgagggagggatggtaaggaggactgcatgggtcagcacaggctgccaaagccagggccagttaaagcgactccaatgcggaagagagtaggtcg
4kBCL11A37:
sequence: gtgtggtgttcggagtcctaagagcccccactagctcagaaatggacttagttgacctcccccattagcagcatggagagtcaaggagatgacttctaccttgccaaaggccttgggaagaaagacagcatcaaggtctcacacaacactccagggaggcagctgctgcccagtgctgtggacagcaaagcttcagtgcaggaaattaagattccccctgcctccccctcccccatcctcatcagcttggccatggcagggctgggggatcagaggtgaacaggaagcagaaggacccctgggggagacagggcctccagtgggaccagagctgagtggcctcaggcagtggcggaagctgattaaaggaaggtacggggagtggaggggaagtggacaaaagacaggacagccatcttagacaacaatgcaagggggagaaactgaagaaaacagaacagagaccactactggcaataaacagagagaaagtgaagccccatgggtgaggcacacctacattacttaagaaacctgagcacattcttacgcctagggcaataaatacatccttgagctacacaggctaagcaagagtgagagagggtgatgctgacaggccacatgggagagtgggaagacgtgggctgggagctgggagtttggcttctcatctgtgcatggcctctaaactgggcagtgaccatggcctggtcacctccccactctggacctgggttgcccctctgtaaacaaggaggttgtaataaattatctccaataccctaatgtcttataaatcttatgcaatttttgccaagatgggagtatggggagagaagagtggaaacggcccagagctcagtgagatgagatatcaaaggggacgaaaagtgttcattccatctccctaatctccaattggcaaagccagacttggggcaatacagactggttctgtgatgacaaataactcctagctcattcctaatgatttatcaccaaatgttctttcttcagctggaatttaaaatatggactcatccgtaaaataggaataataatagtatatgcttcatagggtttgtatgaaaataaaatgagtgcgtatttgtaaagttcctagagcagagtaagtgctccgagcttgtgaactaaaatgctgcctcctggtatttattagttacacctcagcagaaacaaagttatcaggccctttccccaattcctagtttgggtcagaagaaaagggaaaagggagaggaaaaaggaaaagaatatgacgtcagggggaggcaagtcagttgggaacacagatcctaacacagtagctggtacctgataggtgcctatatgtgatggatgggtggacagcccgacagatgaaaaatggacaattatgaggaggggagagtgcagacaggggaagcttcacctcctttacaattttgggagtccacacggcatggcatacaaattatttcattcccattgagaaataaaatccaattctccatcaccaagagagccttccgaaagaggcccccctgggcaaacggccaccgatggagaggtctgccagtcctcttctaccccacccacgcccccaccctaatcagaggccaaacccttcctggagcctGTGATAAAAGCAACTGTTAGcttgcactagactagcttcaaagttgtattgaccctggtgtgttatgtctaagagtagatgccatatctcttttctggcctatgttattacctgtatggactttgcactggaatcagctatctgctcttacttatgcacacctggggcatagagccagccctgtatcgcttttcagccatctcactacagataactcccaagtcctgtctagctgccttccttatcacaggaatagcacccaaggtccatcagtacctcagagtagaaccccctataaactagtctggtttgcccatggggcacagtcaggctgttttccagggtggggtgcagacattctctgcctgttgtgatgcttacatataacgtcataacagacacacgtatgtgttgtgatccctgtggtttgagagtttggagcttccctaaaagtcaaaatattctcaatgggccctcaatcagcacatacacacaaaaggtacctggaaaactgtaattcttttcctgctcaaagacaggcaattcaataccccttcccccaaccaaaaacccttgccaccatgggagcctggggcagagaaggcacagtgaagtcaaactgtaattccaggctctaaatggtgctgtcatttttctgagagtctctaaattacaagggtgttttcactattcttagctattttttaaaacacctaagaaacatactgcagctctggaaaagagaacaaacaaaccaaagagaagggatccagaggtcaccctcatatgtgaaaagtcaattgataatgaaggctttaggataaccggaggggagatgattgaaagcaatgcacctgtgcaggaaatggattacggaaacagggaattgttcatgaaatcccagaaaaccagaaccgggaaagttctggaagtcggaaaaacaaatcatgacttaagcaatggaagtccaatacacgtttacagaatgccttgtcccacgaggcaacacaggctaccacagatgggggacagggtgggagtggaccatcccagtggtgttactgaggggcaaagggatagccctatgaggcaagtgtccagggcagaactggagctttgtgaaaccatttcccaggcagagacagagcactaggctggtgctgccagtctgacaataagtctgccattgtcctctggtcagctctggacacacagcaaaagtgagttcagagtagcctgaagcaggaaagagggaagagaggaggataacacctatcttccactttgctgcaggttcaaggcaaggatttgagacagttaccccttctggaagagcctggtgagtacatctctcctgccttgtacaaccctctctcctcaccgactttctctcccagcagccagcaggggcctgggccatttatggaatgcaagccctgaccacacagacttacttacatgccaggacagccaccaggtagcctttcccactctaggttccactgtgagtgctctctctctctctctctctcactatgctcccaagaggagtcttacatcaaccccttcctcaaatctccctcactggatgtcacagtcataggcctgaaaagcagcatgcaaactgaatttttgtaaagcaggacccatttccccatggacagtcataagagatgagtgaacacaatgtagcacttaatttctgtcttcacgattacttcacgataaatctggattccaaagggactataagctctcacatggaaggaagcaagatctctactcctcccccagtgttgagtggacagggagtacaccgcagacacctgttggccaaccaattctaattccctttagctagcatcccctaagctagagctagagctagagctatttccttgcagccttccttttctctagcaaagtccttccatgcagtagctaatgacctgtaaacacttaatgagctagagaaacattccattgaaaggaataccactgtgcatccttttgtaaagaggggggaaaatcttttgtaaaacgaagcatcgcctttaactgctctgtttgatcaagtcagatttttcagaatatgaatagctagtattcaagcatatatgaactgtctttaagttaatcaatccctagaaactagccctcaggttagcaggccaaggatatatgagagtgctttgaagtctagacttaaactgccgctcct
When running the full pipeline, the results of each step are outputted in a separate folder for each step. The output folders and their respective contents are as follows:
.sorted.bam
and .bai
for each demultiplexed FASTQ file.